SlideShare a Scribd company logo
1 of 79
Download to read offline
Arnaud Bouchez - Synopse
Rewrite for Performance
From Delphi to AVX2
Welcome to
a fun/wakeup session
about performance
hashes
and assembly mystery
Arnaud Bouchez
• Open SourceFounder
mORMot
SynPDF
• Delphiand FPC expert
DDD, SOA, ORM, MVC
Performance,SOLID
• SynopseConsulting
https://synopse.info
Menu du jour
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
The Hash-Table Mystery
mORMot is Fast
The Hash-Table Mystery
mORMot is Fast
and tries to be always faster
The Hash-Table Mystery
mORMot is Fast
and tries to be always faster
so works hard for it
The Hash-Table Mystery
One core component
is TDynArrayHasher
= a hasher for a dynamic array
The Hash-Table Mystery
One core component
is TDynArrayHasher
= a hasher for a dynamic array
<> a hashed list
(it does not own the data)
The Hash-Table Mystery
One core component
is TDynArrayHasher
= a hasher for a dynamic array
Used e.g. by the TDynArray wrapper
the TSynDictionary class
the in-memory ORM engine
The Hash-Table Mystery
How does a Hash-Table work?
bucketindex := hash(key) mod bucketscount
for O(1) retrieval instead of O(n) manual lookup
The Hash-Table Mystery
How does a Hash-Table work?
crc32c()
(hardware accelerated SSE4.2)
The Hash-Table Mystery
How does a Hash-Table work?
xxhash32()
(on non-Intel or old CPUs)
The Hash-Table Mystery
How does a Hash-Table work?
mORMot prefers indexes for efficiency
(and don’t store the hashcode since crc32c is fast)
The Hash-Table Mystery
How does a Hash-Table work?
mORMot stores keys with values
within a (dynamic) array
The Hash-Table Mystery
How does a Hash-Table work?
mORMot can hash several keys
in the same (dynamic) array
The Hash-Table Mystery
How does a Hash-Table work?
It is easy to insert a new item
The Hash-Table Mystery
How does a Hash-Table work?
It is easy to insert a new item
if we handle properly hash collision
The Hash-Table Mystery
How does a Hash-Table work?
the Hard Thing is for Deletion
you can not just reset the slot
since indexes changed
The Hash-Table Mystery
In case of deletion, we may:
1. Re-compute the whole hash table
2. Adjust the indexes
3. Use other algorithm
The Hash-Table Mystery
In case of deletion, we may:
1. Re-compute the whole hash table
What mORMot did for years. Not too bad in practice.
2. Adjust the indexes
Brute force O(n) algorithm.
3. Use other algorithm
More complex, and usually stores the data.
The Hash-Table Mystery
In case of deletion, we may:
1. Re-compute the whole hash table
What mORMot did for years. Not too bad in practice.
2. Adjust the indexes
Brute force O(n) algorithm.
3. Use other algorithm
More complex, and usually stores the data.
The Hash-Table Mystery
On Deletion, Adjust the Indexes
Brute force O(n) algorithm
Seems simple, lean and efficient.
The Hash-Table Mystery
On Deletion, Adjust the Indexes
Brute force O(n) algorithm
Seems simple, lean and efficient.
Let’s try deleting 1/128th of 200,000 items !
The Hash-Table Mystery
On Deletion, Adjust the Indexes
Brute force O(n) algorithm
But not really fast on huge count.
23 #195075 adjust=4.27s 548.6MB/s hash=2.47ms
Why????
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
Branches Are Evil
Alt-F2 : The Obvious Pascal  asm  CPU Flow
Branches Are Evil
Alt-F2 : The Obvious Pascal  asm  CPU Flow
Branches Are Evil
Zilog Z80
nostalgic sight:
“Why would I need more than
16KB RAM on my ZX81?”
Branches Are Evil
Branches Are Evil
Processors Learn to Predict Branches
Since Pentium 4
In case of misprediction,
execution pipelines need to be flushed
… just as if you needed to rewind a tape
Branches Are Evil
Processors Learn to Predict Branches
Since Pentium 4
In case of misprediction,
execution pipelines need to be flushed
… just as if JS needed to garbage collect
Branches Are Evil
Processors Learn to Predict Branches
Each CPU Vendor and Architecture
changes the execution plan
and even introduced Artificial Intelligence
i.e. a CPU is a very complex beast
 don’t trust the code, nor the asm!
Branches Are Evil
Be your own CPU: Let’s Predict !
Branches Are Evil
2 is always taken, 3 is taken but the last time
and 1 is “randomly” taken… so not predictable...
1
2
3
Branches Are Evil
Processors Learn to Predict Branches
Source:
https://lemire.me/blog/2019/10/16/benchmarkin
g-is-hard-processors-learn-to-predict-branches/
Branches Are Evil
Processors Learn to Predict Branches
Pseudo code:
while (howmany != 0) {
val = random();
if( val is an odd integer ) {
out[index] = val;
index += 1;
}
howmany--;
}
Branches Are Evil
Processors Learn to Predict Branches
The more trials, the better prediction…
the CPU somehow learns from its mistakes!
Branches Are Evil
Processors Learn to Predict Branches
Branches Are Evil
Processors Learn to Predict Branches
Perfect prediction! 
Branches Are Evil
Processors Learn to Predict Branches
… but prediction has a depth
From Lemire:
“This perfect prediction on the AMD Rome
falls apart if you grow the problem
from 2000 to 10,000 values: the best
prediction goes from a 0.1% error rate
to a 33% error rate.” 
Branches Are Evil
Processors Learn to Predict Branches
… but prediction has a depth 
From Lemire:
“You should probably avoid benchmarking
branchy code over small problems.”
Branches Are Evil
Processors Learn to Predict Branches
… but prediction has a depth 
From Lemire:
“You should probably avoid benchmarking
branchy code over small problems.”
That’s why I hate microbenchmarks!
And in the Delphi world, I have seen so much!
Branches Are Evil
Branch Misprediction Hurts
if … then …
dec(P[i]) branch is taken or not taken evenly
in not predictable manner
(as random as the hash function itself)
Branches Are Evil
Branch Misprediction Hurts
if … then …
dec(P[i]) branch is taken or not taken evenly
in not predictable manner
Note: unrolling doesn’t help, by definition
Branches Are Evil
What about Going Parallel?
We could divide P[] into sections, and use threads
- it should scale up to how many CPU cores we have
- but we are in a low-level library, so threads are unavailable
- there should be a better way
Branches Are Evil
Introducing a Branch-Less Loop
Branches Are Evil
Introducing a Branch-Less Loop
ord(P[count] > delete)
boolean-to-integer expression returns
either 0 (false) or 1 (true)
and has no branch
Branches Are Evil
Introducing a Branch-Less Loop
FACT: it is actually faster to execute
dec(P[count], 0);
than to handle a mispredicted branch…
(i.e. execute nothing)
Branches Are Evil
Introducing a Branch-Less Loop
while count > 0 is very likely to loop
therefore easy to predict
(by CPU Scheduler convention,
an “upper jump” is estimated as most probable)
Branches Are Evil
Introducing a Branch-Less Loop
ord(P[count] > delete)
compiles to very efficient asm
(branchless setl opcode)
Branches Are Evil
Introducing a Branch-Less Loop
Here, a little unrolling (slightly) helps…
since it avoids an unlikely count <= 0 condition/branch
Branches Are Evil
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
We have almost 10X better performance,
in pure pascal code !
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
SIMD Assembly: SSE2
Can SIMD Improve It Further?
SIMD = Single Instruction,
Multiple Data
SIMD Assembly: SSE2
Can SIMD Improve It Further?
• Data Alignment Restrictions
• Gathering/Scattering is Tricky
• Architecture Specific
• Not native to Delphi or FPC compilers
• Sometimes needs setup/clear
SIMD Assembly: SSE2
SSE2 SIMD Instructions
• Introduced by Intel in 2000 (Pentium 4)
• XMM0 to XMM7 Registers
in 32-bit mode
• XMM0 to XMM15
in x86_64 mode
SIMD Assembly: SSE2
SSE2 SIMD Instructions
• Each 128-bit XMM Register can handle
Two 64-bit Doubles or Integers
Four 32-bit Integers
Eight 16-bit or Sixteen 8-bit Integers
SIMD Assembly: SSE2
SSE2 SIMD Instructions
SIMD Assembly: SSE2
We need to SIMD the following code:
SIMD Assembly: SSE2
We need to SIMD the following code:
We can identify two 4-integers = 128-bit blocks
SIMD Assembly: SSE2
1. Prepare and Align the Input
Parameters: rcx=P edx=deleted r8=count
SIMD Assembly: SSE2
2. Processing Loop
SIMD Assembly: SSE2
3. Trailing Bytes
SIMD Assembly: SSE2
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
sse2 adjust=201.53ms 11.3GB/s
We expected X4
but we got a little less than X3
(pretty good, to be fair)
SIMD Assembly: SSE2
Help Needed?
https://www.agner.org/optimize/
The “Optimization Bible” (also per-CPU timing)
https://gcc.godbolt.org/
Check what best compilers do
https://www.felixcloutier.com/x86/
OpCode Reference Documentation
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
SIMD Assembly: AVX2
AVX2 SIMD Instructions
• AVX introduced in Sandy Bridge 2011
New 128-bit instructions
New coding scheme
• AVX2 introduced in Haswell 2013
YMM 256-bit registers
FusedMultiplyAccumulate (FMA) ops
SIMD Assembly: AVX2
AVX2 SIMD Instructions
• Each 256-bit YMM Register can handle
Four 64-bit Doubles or Integers
Eight 32-bit Integers
Sixteen 16-bit or Thirty-two 8-bit Integers
SIMD Assembly: AVX2
AVX2 SIMD Instructions
• Before using them:
Check the CPUID flag
Ensure the OS is AVX2-Aware
• AVX2 is Supported in FPC asm
• AVX2 is Not Supported in Delphi asm
SIMD Assembly: AVX2
SSE2 Processing Loop
SIMD Assembly: AVX2
New AVX2 Processing Loop
SIMD Assembly: AVX2
Numbers Are Talking
naïve if adjust=4.27s 548.6MB/s
branchless adjust=520.85ms 4.3GB/s
sse2 adjust=201.53ms 11.3GB/s
avx2 adjust=161.73ms 14.1GB/s
We got only 30% better numbers
 We saturated the CPU bandwidth 
• The Hash-Table Mystery
• Branches Are Evil
• SIMD Assembly: SSE2
• SIMD Assembly: AVX2
• Conclusion
Conclusion
• On Deletion, TDynArrayHasher
is not a bottleneck any more
• The TDynArray.Delete data move
takes most time now
• We have a nice pure-pascal version
Conclusion
• Branches are Evil
• Never Trust Micro Benchmarks
• Unrolling is no magic
• Branchless is magic: 10 X faster
• SIMD is worth it if really needed
for another 3 X boost
From Delphi to AVX2
Questions?
No Marmots Were Harmed in the Making of This Session

More Related Content

What's hot

Gestasyonel diyabet ve diyabetli hastanın gebeliği
Gestasyonel diyabet ve diyabetli hastanın gebeliğiGestasyonel diyabet ve diyabetli hastanın gebeliği
Gestasyonel diyabet ve diyabetli hastanın gebeliği
Dilek Gogas Yavuz
 
オープンソースの CFD ソフトウェア SU2 のチュートリアルをやってみた
オープンソースの CFD ソフトウェア SU2 のチュートリアルをやってみたオープンソースの CFD ソフトウェア SU2 のチュートリアルをやってみた
オープンソースの CFD ソフトウェア SU2 のチュートリアルをやってみた
Fumiya Nozaki
 
[CB21] Appearances are deceiving: Novel offensive techniques in Windows 10/11...
[CB21] Appearances are deceiving: Novel offensive techniques in Windows 10/11...[CB21] Appearances are deceiving: Novel offensive techniques in Windows 10/11...
[CB21] Appearances are deceiving: Novel offensive techniques in Windows 10/11...
CODE BLUE
 
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワークSeastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
Takuya ASADA
 
OpenFOAM -空間の離散化と係数行列の取り扱い(Spatial Discretization and Coefficient Matrix)-
OpenFOAM -空間の離散化と係数行列の取り扱い(Spatial Discretization and Coefficient Matrix)-OpenFOAM -空間の離散化と係数行列の取り扱い(Spatial Discretization and Coefficient Matrix)-
OpenFOAM -空間の離散化と係数行列の取り扱い(Spatial Discretization and Coefficient Matrix)-
Fumiya Nozaki
 

What's hot (20)

Siber Güvenlik ve Etik Hacking Sunu - 4
Siber Güvenlik ve Etik Hacking Sunu - 4Siber Güvenlik ve Etik Hacking Sunu - 4
Siber Güvenlik ve Etik Hacking Sunu - 4
 
micro-ROS: bringing ROS 2 to MCUs
micro-ROS: bringing ROS 2 to MCUsmicro-ROS: bringing ROS 2 to MCUs
micro-ROS: bringing ROS 2 to MCUs
 
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
 
Radyasyon guvenligi
Radyasyon guvenligiRadyasyon guvenligi
Radyasyon guvenligi
 
VoIP (Voice Over IP) güvenliği nasıl sağlanmaktadır?
VoIP (Voice Over IP) güvenliği nasıl sağlanmaktadır?VoIP (Voice Over IP) güvenliği nasıl sağlanmaktadır?
VoIP (Voice Over IP) güvenliği nasıl sağlanmaktadır?
 
Verilator勉強会 2021/05/29
Verilator勉強会 2021/05/29Verilator勉強会 2021/05/29
Verilator勉強会 2021/05/29
 
Gestasyonel diyabet ve diyabetli hastanın gebeliği
Gestasyonel diyabet ve diyabetli hastanın gebeliğiGestasyonel diyabet ve diyabetli hastanın gebeliği
Gestasyonel diyabet ve diyabetli hastanın gebeliği
 
オープンソースの CFD ソフトウェア SU2 のチュートリアルをやってみた
オープンソースの CFD ソフトウェア SU2 のチュートリアルをやってみたオープンソースの CFD ソフトウェア SU2 のチュートリアルをやってみた
オープンソースの CFD ソフトウェア SU2 のチュートリアルをやってみた
 
OpenFOAM を用いた Adjoint 形状最適化事例1
OpenFOAM を用いた Adjoint 形状最適化事例1OpenFOAM を用いた Adjoint 形状最適化事例1
OpenFOAM を用いた Adjoint 形状最適化事例1
 
OpeLa: セルフホストなOSと言語処理系を作るプロジェクト
OpeLa: セルフホストなOSと言語処理系を作るプロジェクトOpeLa: セルフホストなOSと言語処理系を作るプロジェクト
OpeLa: セルフホストなOSと言語処理系を作るプロジェクト
 
SSL/TLSの基礎と最新動向
SSL/TLSの基礎と最新動向SSL/TLSの基礎と最新動向
SSL/TLSの基礎と最新動向
 
[CB21] Appearances are deceiving: Novel offensive techniques in Windows 10/11...
[CB21] Appearances are deceiving: Novel offensive techniques in Windows 10/11...[CB21] Appearances are deceiving: Novel offensive techniques in Windows 10/11...
[CB21] Appearances are deceiving: Novel offensive techniques in Windows 10/11...
 
GEBELİKTE KARDİYOVASKÜLER HASTALIKLAR
GEBELİKTE KARDİYOVASKÜLER HASTALIKLAR GEBELİKTE KARDİYOVASKÜLER HASTALIKLAR
GEBELİKTE KARDİYOVASKÜLER HASTALIKLAR
 
GEBELİK VE KALP HASTALIKLARI
GEBELİK VE KALP HASTALIKLARI GEBELİK VE KALP HASTALIKLARI
GEBELİK VE KALP HASTALIKLARI
 
FPGAでCortex-M1を味見する
FPGAでCortex-M1を味見するFPGAでCortex-M1を味見する
FPGAでCortex-M1を味見する
 
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワークSeastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
 
つくってあそぼ ラムダ計算インタプリタ
つくってあそぼ ラムダ計算インタプリタつくってあそぼ ラムダ計算インタプリタ
つくってあそぼ ラムダ計算インタプリタ
 
Implementation and Simulation of Ieee 754 Single-Precision Floating Point Mul...
Implementation and Simulation of Ieee 754 Single-Precision Floating Point Mul...Implementation and Simulation of Ieee 754 Single-Precision Floating Point Mul...
Implementation and Simulation of Ieee 754 Single-Precision Floating Point Mul...
 
OpenFOAM -空間の離散化と係数行列の取り扱い(Spatial Discretization and Coefficient Matrix)-
OpenFOAM -空間の離散化と係数行列の取り扱い(Spatial Discretization and Coefficient Matrix)-OpenFOAM -空間の離散化と係数行列の取り扱い(Spatial Discretization and Coefficient Matrix)-
OpenFOAM -空間の離散化と係数行列の取り扱い(Spatial Discretization and Coefficient Matrix)-
 
Ponylangとこれからの並行プログラミング
Ponylangとこれからの並行プログラミングPonylangとこれからの並行プログラミング
Ponylangとこれからの並行プログラミング
 

Similar to Ekon24 from Delphi to AVX2

Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
Manchor Ko
 
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCsw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
CanSecWest
 

Similar to Ekon24 from Delphi to AVX2 (20)

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itEvgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java Developer
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis
 
Kickin' Ass with Cache-Fu (with notes)
Kickin' Ass with Cache-Fu (with notes)Kickin' Ass with Cache-Fu (with notes)
Kickin' Ass with Cache-Fu (with notes)
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation
 
Introduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimizationIntroduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimization
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis
 
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCsw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 

More from Arnaud Bouchez

More from Arnaud Bouchez (20)

EKON27-FrameworksTuning.pdf
EKON27-FrameworksTuning.pdfEKON27-FrameworksTuning.pdf
EKON27-FrameworksTuning.pdf
 
EKON27-FrameworksExpressiveness.pdf
EKON27-FrameworksExpressiveness.pdfEKON27-FrameworksExpressiveness.pdf
EKON27-FrameworksExpressiveness.pdf
 
Ekon25 mORMot 2 Server-Side Notifications
Ekon25 mORMot 2 Server-Side NotificationsEkon25 mORMot 2 Server-Side Notifications
Ekon25 mORMot 2 Server-Side Notifications
 
Ekon25 mORMot 2 Cryptography
Ekon25 mORMot 2 CryptographyEkon25 mORMot 2 Cryptography
Ekon25 mORMot 2 Cryptography
 
Ekon24 mORMot 2
Ekon24 mORMot 2Ekon24 mORMot 2
Ekon24 mORMot 2
 
Ekon23 (2) Kingdom-Driven-Design applied to Social Media with mORMot
Ekon23 (2) Kingdom-Driven-Design applied to Social Media with mORMotEkon23 (2) Kingdom-Driven-Design applied to Social Media with mORMot
Ekon23 (2) Kingdom-Driven-Design applied to Social Media with mORMot
 
Ekon23 (1) Kingdom-Driven-Design
Ekon23 (1) Kingdom-Driven-DesignEkon23 (1) Kingdom-Driven-Design
Ekon23 (1) Kingdom-Driven-Design
 
High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)High Performance Object Pascal Code on Servers (at EKON 22)
High Performance Object Pascal Code on Servers (at EKON 22)
 
Object Pascal Clean Code Guidelines Proposal (at EKON 22)
Object Pascal Clean Code Guidelines Proposal (at EKON 22)Object Pascal Clean Code Guidelines Proposal (at EKON 22)
Object Pascal Clean Code Guidelines Proposal (at EKON 22)
 
Ekon21 Microservices - SOLID Meets SOA
Ekon21 Microservices - SOLID Meets SOAEkon21 Microservices - SOLID Meets SOA
Ekon21 Microservices - SOLID Meets SOA
 
Ekon21 Microservices - Event Driven Design
Ekon21 Microservices - Event Driven DesignEkon21 Microservices - Event Driven Design
Ekon21 Microservices - Event Driven Design
 
Ekon20 mORMot WorkShop Delphi
Ekon20 mORMot WorkShop DelphiEkon20 mORMot WorkShop Delphi
Ekon20 mORMot WorkShop Delphi
 
Ekon20 mORMot SOA Delphi Conference
Ekon20 mORMot SOA Delphi Conference Ekon20 mORMot SOA Delphi Conference
Ekon20 mORMot SOA Delphi Conference
 
Ekon20 mORMot Legacy Code Technical Debt Delphi Conference
Ekon20 mORMot Legacy Code Technical Debt Delphi Conference Ekon20 mORMot Legacy Code Technical Debt Delphi Conference
Ekon20 mORMot Legacy Code Technical Debt Delphi Conference
 
2016 mORMot
2016 mORMot2016 mORMot
2016 mORMot
 
A1 from n tier to soa
A1 from n tier to soaA1 from n tier to soa
A1 from n tier to soa
 
D1 from interfaces to solid
D1 from interfaces to solidD1 from interfaces to solid
D1 from interfaces to solid
 
A3 from sql to orm
A3 from sql to ormA3 from sql to orm
A3 from sql to orm
 
A2 from soap to rest
A2 from soap to restA2 from soap to rest
A2 from soap to rest
 
D2 domain driven-design
D2 domain driven-designD2 domain driven-design
D2 domain driven-design
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 

Ekon24 from Delphi to AVX2

  • 1. Arnaud Bouchez - Synopse Rewrite for Performance From Delphi to AVX2
  • 2. Welcome to a fun/wakeup session about performance hashes and assembly mystery
  • 3. Arnaud Bouchez • Open SourceFounder mORMot SynPDF • Delphiand FPC expert DDD, SOA, ORM, MVC Performance,SOLID • SynopseConsulting https://synopse.info
  • 4. Menu du jour • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 5. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 7. The Hash-Table Mystery mORMot is Fast and tries to be always faster
  • 8. The Hash-Table Mystery mORMot is Fast and tries to be always faster so works hard for it
  • 9. The Hash-Table Mystery One core component is TDynArrayHasher = a hasher for a dynamic array
  • 10. The Hash-Table Mystery One core component is TDynArrayHasher = a hasher for a dynamic array <> a hashed list (it does not own the data)
  • 11. The Hash-Table Mystery One core component is TDynArrayHasher = a hasher for a dynamic array Used e.g. by the TDynArray wrapper the TSynDictionary class the in-memory ORM engine
  • 12. The Hash-Table Mystery How does a Hash-Table work? bucketindex := hash(key) mod bucketscount for O(1) retrieval instead of O(n) manual lookup
  • 13. The Hash-Table Mystery How does a Hash-Table work? crc32c() (hardware accelerated SSE4.2)
  • 14. The Hash-Table Mystery How does a Hash-Table work? xxhash32() (on non-Intel or old CPUs)
  • 15. The Hash-Table Mystery How does a Hash-Table work? mORMot prefers indexes for efficiency (and don’t store the hashcode since crc32c is fast)
  • 16. The Hash-Table Mystery How does a Hash-Table work? mORMot stores keys with values within a (dynamic) array
  • 17. The Hash-Table Mystery How does a Hash-Table work? mORMot can hash several keys in the same (dynamic) array
  • 18. The Hash-Table Mystery How does a Hash-Table work? It is easy to insert a new item
  • 19. The Hash-Table Mystery How does a Hash-Table work? It is easy to insert a new item if we handle properly hash collision
  • 20. The Hash-Table Mystery How does a Hash-Table work? the Hard Thing is for Deletion you can not just reset the slot since indexes changed
  • 21. The Hash-Table Mystery In case of deletion, we may: 1. Re-compute the whole hash table 2. Adjust the indexes 3. Use other algorithm
  • 22. The Hash-Table Mystery In case of deletion, we may: 1. Re-compute the whole hash table What mORMot did for years. Not too bad in practice. 2. Adjust the indexes Brute force O(n) algorithm. 3. Use other algorithm More complex, and usually stores the data.
  • 23. The Hash-Table Mystery In case of deletion, we may: 1. Re-compute the whole hash table What mORMot did for years. Not too bad in practice. 2. Adjust the indexes Brute force O(n) algorithm. 3. Use other algorithm More complex, and usually stores the data.
  • 24.
  • 25. The Hash-Table Mystery On Deletion, Adjust the Indexes Brute force O(n) algorithm Seems simple, lean and efficient.
  • 26. The Hash-Table Mystery On Deletion, Adjust the Indexes Brute force O(n) algorithm Seems simple, lean and efficient. Let’s try deleting 1/128th of 200,000 items !
  • 27. The Hash-Table Mystery On Deletion, Adjust the Indexes Brute force O(n) algorithm But not really fast on huge count. 23 #195075 adjust=4.27s 548.6MB/s hash=2.47ms Why????
  • 28. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 29. Branches Are Evil Alt-F2 : The Obvious Pascal  asm  CPU Flow
  • 30. Branches Are Evil Alt-F2 : The Obvious Pascal  asm  CPU Flow
  • 31. Branches Are Evil Zilog Z80 nostalgic sight: “Why would I need more than 16KB RAM on my ZX81?”
  • 33. Branches Are Evil Processors Learn to Predict Branches Since Pentium 4 In case of misprediction, execution pipelines need to be flushed … just as if you needed to rewind a tape
  • 34. Branches Are Evil Processors Learn to Predict Branches Since Pentium 4 In case of misprediction, execution pipelines need to be flushed … just as if JS needed to garbage collect
  • 35. Branches Are Evil Processors Learn to Predict Branches Each CPU Vendor and Architecture changes the execution plan and even introduced Artificial Intelligence i.e. a CPU is a very complex beast  don’t trust the code, nor the asm!
  • 36. Branches Are Evil Be your own CPU: Let’s Predict !
  • 37. Branches Are Evil 2 is always taken, 3 is taken but the last time and 1 is “randomly” taken… so not predictable... 1 2 3
  • 38. Branches Are Evil Processors Learn to Predict Branches Source: https://lemire.me/blog/2019/10/16/benchmarkin g-is-hard-processors-learn-to-predict-branches/
  • 39. Branches Are Evil Processors Learn to Predict Branches Pseudo code: while (howmany != 0) { val = random(); if( val is an odd integer ) { out[index] = val; index += 1; } howmany--; }
  • 40. Branches Are Evil Processors Learn to Predict Branches The more trials, the better prediction… the CPU somehow learns from its mistakes!
  • 41. Branches Are Evil Processors Learn to Predict Branches
  • 42. Branches Are Evil Processors Learn to Predict Branches Perfect prediction! 
  • 43. Branches Are Evil Processors Learn to Predict Branches … but prediction has a depth From Lemire: “This perfect prediction on the AMD Rome falls apart if you grow the problem from 2000 to 10,000 values: the best prediction goes from a 0.1% error rate to a 33% error rate.” 
  • 44. Branches Are Evil Processors Learn to Predict Branches … but prediction has a depth  From Lemire: “You should probably avoid benchmarking branchy code over small problems.”
  • 45. Branches Are Evil Processors Learn to Predict Branches … but prediction has a depth  From Lemire: “You should probably avoid benchmarking branchy code over small problems.” That’s why I hate microbenchmarks! And in the Delphi world, I have seen so much!
  • 46. Branches Are Evil Branch Misprediction Hurts if … then … dec(P[i]) branch is taken or not taken evenly in not predictable manner (as random as the hash function itself)
  • 47. Branches Are Evil Branch Misprediction Hurts if … then … dec(P[i]) branch is taken or not taken evenly in not predictable manner Note: unrolling doesn’t help, by definition
  • 48. Branches Are Evil What about Going Parallel? We could divide P[] into sections, and use threads - it should scale up to how many CPU cores we have - but we are in a low-level library, so threads are unavailable - there should be a better way
  • 49. Branches Are Evil Introducing a Branch-Less Loop
  • 50. Branches Are Evil Introducing a Branch-Less Loop ord(P[count] > delete) boolean-to-integer expression returns either 0 (false) or 1 (true) and has no branch
  • 51. Branches Are Evil Introducing a Branch-Less Loop FACT: it is actually faster to execute dec(P[count], 0); than to handle a mispredicted branch… (i.e. execute nothing)
  • 52. Branches Are Evil Introducing a Branch-Less Loop while count > 0 is very likely to loop therefore easy to predict (by CPU Scheduler convention, an “upper jump” is estimated as most probable)
  • 53. Branches Are Evil Introducing a Branch-Less Loop ord(P[count] > delete) compiles to very efficient asm (branchless setl opcode)
  • 54. Branches Are Evil Introducing a Branch-Less Loop Here, a little unrolling (slightly) helps… since it avoids an unlikely count <= 0 condition/branch
  • 55. Branches Are Evil Numbers Are Talking naïve if adjust=4.27s 548.6MB/s branchless adjust=520.85ms 4.3GB/s We have almost 10X better performance, in pure pascal code !
  • 56. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 57. SIMD Assembly: SSE2 Can SIMD Improve It Further? SIMD = Single Instruction, Multiple Data
  • 58. SIMD Assembly: SSE2 Can SIMD Improve It Further? • Data Alignment Restrictions • Gathering/Scattering is Tricky • Architecture Specific • Not native to Delphi or FPC compilers • Sometimes needs setup/clear
  • 59. SIMD Assembly: SSE2 SSE2 SIMD Instructions • Introduced by Intel in 2000 (Pentium 4) • XMM0 to XMM7 Registers in 32-bit mode • XMM0 to XMM15 in x86_64 mode
  • 60. SIMD Assembly: SSE2 SSE2 SIMD Instructions • Each 128-bit XMM Register can handle Two 64-bit Doubles or Integers Four 32-bit Integers Eight 16-bit or Sixteen 8-bit Integers
  • 61. SIMD Assembly: SSE2 SSE2 SIMD Instructions
  • 62. SIMD Assembly: SSE2 We need to SIMD the following code:
  • 63. SIMD Assembly: SSE2 We need to SIMD the following code: We can identify two 4-integers = 128-bit blocks
  • 64. SIMD Assembly: SSE2 1. Prepare and Align the Input Parameters: rcx=P edx=deleted r8=count
  • 65. SIMD Assembly: SSE2 2. Processing Loop
  • 66. SIMD Assembly: SSE2 3. Trailing Bytes
  • 67. SIMD Assembly: SSE2 Numbers Are Talking naïve if adjust=4.27s 548.6MB/s branchless adjust=520.85ms 4.3GB/s sse2 adjust=201.53ms 11.3GB/s We expected X4 but we got a little less than X3 (pretty good, to be fair)
  • 68. SIMD Assembly: SSE2 Help Needed? https://www.agner.org/optimize/ The “Optimization Bible” (also per-CPU timing) https://gcc.godbolt.org/ Check what best compilers do https://www.felixcloutier.com/x86/ OpCode Reference Documentation
  • 69. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 70. SIMD Assembly: AVX2 AVX2 SIMD Instructions • AVX introduced in Sandy Bridge 2011 New 128-bit instructions New coding scheme • AVX2 introduced in Haswell 2013 YMM 256-bit registers FusedMultiplyAccumulate (FMA) ops
  • 71. SIMD Assembly: AVX2 AVX2 SIMD Instructions • Each 256-bit YMM Register can handle Four 64-bit Doubles or Integers Eight 32-bit Integers Sixteen 16-bit or Thirty-two 8-bit Integers
  • 72. SIMD Assembly: AVX2 AVX2 SIMD Instructions • Before using them: Check the CPUID flag Ensure the OS is AVX2-Aware • AVX2 is Supported in FPC asm • AVX2 is Not Supported in Delphi asm
  • 73. SIMD Assembly: AVX2 SSE2 Processing Loop
  • 74. SIMD Assembly: AVX2 New AVX2 Processing Loop
  • 75. SIMD Assembly: AVX2 Numbers Are Talking naïve if adjust=4.27s 548.6MB/s branchless adjust=520.85ms 4.3GB/s sse2 adjust=201.53ms 11.3GB/s avx2 adjust=161.73ms 14.1GB/s We got only 30% better numbers  We saturated the CPU bandwidth 
  • 76. • The Hash-Table Mystery • Branches Are Evil • SIMD Assembly: SSE2 • SIMD Assembly: AVX2 • Conclusion
  • 77. Conclusion • On Deletion, TDynArrayHasher is not a bottleneck any more • The TDynArray.Delete data move takes most time now • We have a nice pure-pascal version
  • 78. Conclusion • Branches are Evil • Never Trust Micro Benchmarks • Unrolling is no magic • Branchless is magic: 10 X faster • SIMD is worth it if really needed for another 3 X boost
  • 79. From Delphi to AVX2 Questions? No Marmots Were Harmed in the Making of This Session