SlideShare a Scribd company logo
Three Optimization Tips for C++

                                Andrei Alexandrescu, Ph.D.
                                   Research Scientist, Facebook
                                  andrei.alexandrescu@fb.com




© 2012- Facebook. Do not redistribute.                            1 / 33
This Talk




         • Basics
         • Reduce strength
         • Minimize array writes




© 2012- Facebook. Do not redistribute.   2 / 33
Things I Shouldn’t Even




© 2012- Facebook. Do not redistribute.       3 / 33
Today’s Computing Architectures



         • Extremely complex
         • Trade reproducible performance for average
           speed
         • Interrupts, multiprocessing are the norm
         • Dynamic frequency control is becoming
           common
         • Virtually impossible to get identical timings
           for experiments




© 2012- Facebook. Do not redistribute.                     4 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”
         • “Computation is faster than data”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”
         • “Computation is faster than data”




© 2012- Facebook. Do not redistribute.               5 / 33
Intuition


         • Ignores aspects of a complex reality
         • Makes narrow/obsolete/wrong assumptions


         • “Fewer instructions = faster code”
         • “Data is faster than computation”
         • “Computation is faster than data”


         • The only good intuition: “I should time this.”



© 2012- Facebook. Do not redistribute.                      5 / 33
Paradox




      Measuring gives you a
      leg up on experts who
      don’t need to measure


© 2012- Facebook. Do not redistribute.   6 / 33
Common Pitfalls


         • Measuring speed of debug builds
         • Different setup for baseline and measured
             ◦ Sequencing: heap allocator
             ◦ Warmth of cache, files, databases, DNS
         • Including ancillary work in measurement
             ◦ malloc, printf common
         • Mixtures: measure ta + tb , improve ta ,
           conclude tb got improved
         • Optimize rare cases, pessimize others



© 2012- Facebook. Do not redistribute.                 7 / 33
Optimizing Rare Cases




© 2012- Facebook. Do not redistribute.   8 / 33
More generalities




         • Prefer static linking and PDC
         • Prefer 64-bit code, 32-bit data
         • Prefer (32-bit) array indexing to pointers
            ◦ Prefer a[i++] to a[++i]
         • Prefer regular memory access patterns
         • Minimize flow, avoid data dependencies




© 2012- Facebook. Do not redistribute.                  9 / 33
Storage Pecking Order



         • Use enum for integral constants
         • Use static const for other immutables
            ◦ Beware cache issues
         • Use stack for most variables
         • Globals: aliasing issues
         • thread_local slowest, use local caching
            ◦ 1 instruction in Windows, Linux
            ◦ 3-4 in OSX




© 2012- Facebook. Do not redistribute.               10 / 33
Reduce Strength




© 2012- Facebook. Do not redistribute.             11 / 33
Strength reduction



         • Speed hierarchy:
            ◦ comparisons
            ◦ (u)int add, subtract, bitops, shift
            ◦ FP add, sub (separate unit!)
            ◦ Indexed array access
            ◦ (u)int32 mul; FP mul
            ◦ FP division, remainder
            ◦ (u)int division, remainder




© 2012- Facebook. Do not redistribute.              12 / 33
Your Compiler Called




        I get it. a >>= 1 is the
           same as a /= 2.


© 2012- Facebook. Do not redistribute.   13 / 33
Integrals



         • Prefer 32-bit ints to all other sizes
            ◦ 64 bit may make some code slower
            ◦ 8, 16-bit computations use conversion to
              32 bits and back
            ◦ Use small ints in arrays
         • Prefer unsigned to signed
            ◦ Except when converting to floating point
         • “Most numbers are small”




© 2012- Facebook. Do not redistribute.                   14 / 33
Floating Point



         • Double precision as fast as single precision
         • Extended precision just a bit slower
         • Do not mix the three
         • 1-2 FP addition/subtraction units
         • 1-2 FP multiplication/division units
         • SSE accelerates throughput for certain
           computation kernels
         • ints→FPs cheap, FPs→ints expensive




© 2012- Facebook. Do not redistribute.                    15 / 33
Advice




      Design algorithms to
     use minimum operation
            strength


© 2012- Facebook. Do not redistribute.   16 / 33
Strength reduction: Example


         • Digit count in base-10 representation
       uint32_t digits10(uint64_t v) {
          uint32_t result = 0;
          do {
             ++result;
             v /= 10;
          } while (v);
          return result;
       }

         • Uses integral division extensively
            ◦ (Actually: multiplication)


© 2012- Facebook. Do not redistribute.             17 / 33
Strength reduction: Example

       uint32_t digits10(uint64_t v) {
          uint32_t result = 1;
          for (;;) {
             if (v < 10) return result;
             if (v < 100) return result + 1;
             if (v < 1000) return result + 2;
             if (v < 10000) return result + 3;
             // Skip ahead by 4 orders of magnitude
             v /= 10000U;
             result += 4;
          }
       }

         • More comparisons and additions, fewer /=
         • (This is not loop unrolling!)
© 2012- Facebook. Do not redistribute.                18 / 33
Minimize Array Writes




© 2012- Facebook. Do not redistribute.        20 / 33
Minimize Array Writes: Why?



         •   Disables enregistering
         •   A write is really a read and a write
         •   Aliasing makes things difficult
         •   Maculates the cache



         • Generally just difficult to optimize




© 2012- Facebook. Do not redistribute.              21 / 33
Minimize Array Writes


       uint32_t u64ToAsciiClassic(uint64_t value, char* dst) {
          // Write backwards.
          auto start = dst;
          do {
             *dst++ = ’0’ + (value % 10);
             value /= 10;
          } while (value != 0);
          const uint32_t result = dst - start;
          // Reverse in place.
          for (dst--; dst > start; start++, dst--) {
             std::iter_swap(dst, start);
          }
          return result;
       }




© 2012- Facebook. Do not redistribute.                           22 / 33
Minimize Array Writes
         • Gambit: make one extra pass to compute
            length
       uint32_t uint64ToAscii(uint64_t v, char *const buffer) {
          auto const result = digits10(v);
          uint32_t pos = result - 1;
          while (v >= 10) {
             auto const q = v / 10;
             auto const r = static_cast<uint32_t>(v % 10);
             buffer[pos--] = ’0’ + r;
             v = q;
          }
          assert(pos == 0);
          // Last digit is trivial to handle
          *buffer = static_cast<uint32_t>(v) + ’0’;
          return result;
       }


© 2012- Facebook. Do not redistribute.                            23 / 33
Improvements




         •   Fewer array writes
         •   Regular access patterns
         •   Fast on small numbers
         •   Data dependencies reduced




© 2012- Facebook. Do not redistribute.   24 / 33
One More Pass




         • Reformulate digits10 as search
         • Convert two digits at a time




© 2012- Facebook. Do not redistribute.      26 / 33
uint32_t         digits10(uint64_t v) {
          if (v         < P01) return 1;
          if (v         < P02) return 2;
          if (v         < P03) return 3;
          if (v         < P12) {
             if         (v < P08) {
                        if (v < P06) {
                           if (v < P04) return 4;
                           return 5 + (v < P05);
                        }
                        return 7 + (v >= P07);
                   }
                   if (v < P10) {
                      return 9 + (v >= P09);
                   }
                   return 11 + (v >= P11);
             }
             return 12 + digits10(v / P12);
       }

© 2012- Facebook. Do not redistribute.              27 / 33
unsigned u64ToAsciiTable(uint64_t value, char* dst) {
          static const char digits[201] =
             "0001020304050607080910111213141516171819"
             "2021222324252627282930313233343536373839"
             "4041424344454647484950515253545556575859"
             "6061626364656667686970717273747576777879"
             "8081828384858687888990919293949596979899";
          uint32_t const length = digits10(value);
          uint32_t next = length - 1;
          while (value >= 100) {
             auto const i = (value % 100) * 2;
             value /= 100;
             dst[next] = digits[i + 1];
             dst[next - 1] = digits[i];
             next -= 2;
          }




© 2012- Facebook. Do not redistribute.                         28 / 33
// Handle last 1-2 digits
             if (value < 10) {
                dst[next] = ’0’ + uint32_t(value);
             } else {
                auto i = uint32_t(value) * 2;
                dst[next] = digits[i + 1];
                dst[next - 1] = digits[i];
             }
             return length;
       }




© 2012- Facebook. Do not redistribute.               29 / 33
Summary




© 2012- Facebook. Do not redistribute.             32 / 33
Summary




         • You can’t improve what you can’t measure
            ◦ Pro tip: You can’t measure what you don’t
              measure
         • Reduce strength
         • Minimize array writes




© 2012- Facebook. Do not redistribute.                    33 / 33

More Related Content

What's hot

Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
Graham Wihlidal
 
Modern C++ 프로그래머를 위한 CPP11/14 핵심
Modern C++ 프로그래머를 위한 CPP11/14 핵심Modern C++ 프로그래머를 위한 CPP11/14 핵심
Modern C++ 프로그래머를 위한 CPP11/14 핵심흥배 최
 
모델 서빙 파이프라인 구축하기
모델 서빙 파이프라인 구축하기모델 서빙 파이프라인 구축하기
모델 서빙 파이프라인 구축하기
SeongIkKim2
 
[Container Runtime Meetup] runc & User Namespaces
[Container Runtime Meetup] runc & User Namespaces[Container Runtime Meetup] runc & User Namespaces
[Container Runtime Meetup] runc & User Namespaces
Akihiro Suda
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
NVIDIA Japan
 
Iocp 기본 구조 이해
Iocp 기본 구조 이해Iocp 기본 구조 이해
Iocp 기본 구조 이해
Nam Hyeonuk
 
[부스트캠프 Tech Talk] 신원지_Wandb Visualization
[부스트캠프 Tech Talk] 신원지_Wandb Visualization[부스트캠프 Tech Talk] 신원지_Wandb Visualization
[부스트캠프 Tech Talk] 신원지_Wandb Visualization
CONNECT FOUNDATION
 
【Unity道場スペシャル 2017京都】最適化をする前に覚えておきたい技術
【Unity道場スペシャル 2017京都】最適化をする前に覚えておきたい技術【Unity道場スペシャル 2017京都】最適化をする前に覚えておきたい技術
【Unity道場スペシャル 2017京都】最適化をする前に覚えておきたい技術
Unity Technologies Japan K.K.
 
フロントエンドの技術を刷新した話し。
フロントエンドの技術を刷新した話し。フロントエンドの技術を刷新した話し。
フロントエンドの技術を刷新した話し。
Yutaka Horikawa
 
自宅サーバ仮想化
自宅サーバ仮想化自宅サーバ仮想化
自宅サーバ仮想化
anubis_369
 
お前は PHP の歴史的な理由の数を覚えているのか
お前は PHP の歴史的な理由の数を覚えているのかお前は PHP の歴史的な理由の数を覚えているのか
お前は PHP の歴史的な理由の数を覚えているのか
Kousuke Ebihara
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
Denys Haryachyy
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能
Kohei Tokunaga
 
Rhel cluster gfs_improveperformance
Rhel cluster gfs_improveperformanceRhel cluster gfs_improveperformance
Rhel cluster gfs_improveperformance
sprdd
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
Brendan Gregg
 
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
Takateru Yamagishi
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
ScyllaDB
 
Elmで始めるFunctional Reactive Programming
Elmで始めるFunctional Reactive Programming Elmで始めるFunctional Reactive Programming
Elmで始めるFunctional Reactive Programming
Yasuyuki Maeda
 
あるmmapの話
あるmmapの話あるmmapの話
あるmmapの話
nullnilaki
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
Divye Kapoor
 

What's hot (20)

Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Modern C++ 프로그래머를 위한 CPP11/14 핵심
Modern C++ 프로그래머를 위한 CPP11/14 핵심Modern C++ 프로그래머를 위한 CPP11/14 핵심
Modern C++ 프로그래머를 위한 CPP11/14 핵심
 
모델 서빙 파이프라인 구축하기
모델 서빙 파이프라인 구축하기모델 서빙 파이프라인 구축하기
모델 서빙 파이프라인 구축하기
 
[Container Runtime Meetup] runc & User Namespaces
[Container Runtime Meetup] runc & User Namespaces[Container Runtime Meetup] runc & User Namespaces
[Container Runtime Meetup] runc & User Namespaces
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
 
Iocp 기본 구조 이해
Iocp 기본 구조 이해Iocp 기본 구조 이해
Iocp 기본 구조 이해
 
[부스트캠프 Tech Talk] 신원지_Wandb Visualization
[부스트캠프 Tech Talk] 신원지_Wandb Visualization[부스트캠프 Tech Talk] 신원지_Wandb Visualization
[부스트캠프 Tech Talk] 신원지_Wandb Visualization
 
【Unity道場スペシャル 2017京都】最適化をする前に覚えておきたい技術
【Unity道場スペシャル 2017京都】最適化をする前に覚えておきたい技術【Unity道場スペシャル 2017京都】最適化をする前に覚えておきたい技術
【Unity道場スペシャル 2017京都】最適化をする前に覚えておきたい技術
 
フロントエンドの技術を刷新した話し。
フロントエンドの技術を刷新した話し。フロントエンドの技術を刷新した話し。
フロントエンドの技術を刷新した話し。
 
自宅サーバ仮想化
自宅サーバ仮想化自宅サーバ仮想化
自宅サーバ仮想化
 
お前は PHP の歴史的な理由の数を覚えているのか
お前は PHP の歴史的な理由の数を覚えているのかお前は PHP の歴史的な理由の数を覚えているのか
お前は PHP の歴史的な理由の数を覚えているのか
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能
 
Rhel cluster gfs_improveperformance
Rhel cluster gfs_improveperformanceRhel cluster gfs_improveperformance
Rhel cluster gfs_improveperformance
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
 
Elmで始めるFunctional Reactive Programming
Elmで始めるFunctional Reactive Programming Elmで始めるFunctional Reactive Programming
Elmで始めるFunctional Reactive Programming
 
あるmmapの話
あるmmapの話あるmmapの話
あるmmapの話
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 

Viewers also liked

Dconf2015 d2 t4
Dconf2015 d2 t4Dconf2015 d2 t4
Dconf2015 d2 t4
Andrei Alexandrescu
 
Dconf2015 d2 t3
Dconf2015 d2 t3Dconf2015 d2 t3
Dconf2015 d2 t3
Andrei Alexandrescu
 
Solid C++ by Example
Solid C++ by ExampleSolid C++ by Example
Solid C++ by Example
Olve Maudal
 
What every C++ programmer should know about modern compilers (w/o comments, A...
What every C++ programmer should know about modern compilers (w/o comments, A...What every C++ programmer should know about modern compilers (w/o comments, A...
What every C++ programmer should know about modern compilers (w/o comments, A...
Sławomir Zborowski
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
guest3eed30
 
Stabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance EvaluationStabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance Evaluation
Emery Berger
 
ACCU Keynote by Andrei Alexandrescu
ACCU Keynote by Andrei AlexandrescuACCU Keynote by Andrei Alexandrescu
ACCU Keynote by Andrei Alexandrescu
Andrei Alexandrescu
 
Generic Programming Galore Using D
Generic Programming Galore Using DGeneric Programming Galore Using D
Generic Programming Galore Using D
Andrei Alexandrescu
 
Three, no, Four Cool Things About D
Three, no, Four Cool Things About DThree, no, Four Cool Things About D
Three, no, Four Cool Things About D
Andrei Alexandrescu
 
iOS 6 Exploitation 280 days later
iOS 6 Exploitation 280 days lateriOS 6 Exploitation 280 days later
iOS 6 Exploitation 280 days later
Wang Hao Lee
 
Informatica 1_1_.pdf;filename_= utf-8''informatica_(1)[1]
Informatica  1_1_.pdf;filename_= utf-8''informatica_(1)[1]Informatica  1_1_.pdf;filename_= utf-8''informatica_(1)[1]
Informatica 1_1_.pdf;filename_= utf-8''informatica_(1)[1]
CARLOS VIERA
 
Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4
Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4
Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4
khair20
 
C++ on the Web: Run your big 3D game in the browser
C++ on the Web: Run your big 3D game in the browserC++ on the Web: Run your big 3D game in the browser
C++ on the Web: Run your big 3D game in the browser
Andre Weissflog
 
Deep C Programming
Deep C ProgrammingDeep C Programming
Deep C Programming
Wang Hao Lee
 
C++11
C++11C++11
C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)
Olve Maudal
 
Insecure coding in C (and C++)
Insecure coding in C (and C++)Insecure coding in C (and C++)
Insecure coding in C (and C++)
Olve Maudal
 
Code Optimization
Code OptimizationCode Optimization
Code Optimization
guest9f8315
 
Code generation
Code generationCode generation
Code generation
Aparna Nayak
 

Viewers also liked (20)

Dconf2015 d2 t4
Dconf2015 d2 t4Dconf2015 d2 t4
Dconf2015 d2 t4
 
Dconf2015 d2 t3
Dconf2015 d2 t3Dconf2015 d2 t3
Dconf2015 d2 t3
 
Solid C++ by Example
Solid C++ by ExampleSolid C++ by Example
Solid C++ by Example
 
What every C++ programmer should know about modern compilers (w/o comments, A...
What every C++ programmer should know about modern compilers (w/o comments, A...What every C++ programmer should know about modern compilers (w/o comments, A...
What every C++ programmer should know about modern compilers (w/o comments, A...
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Stabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance EvaluationStabilizer: Statistically Sound Performance Evaluation
Stabilizer: Statistically Sound Performance Evaluation
 
ACCU Keynote by Andrei Alexandrescu
ACCU Keynote by Andrei AlexandrescuACCU Keynote by Andrei Alexandrescu
ACCU Keynote by Andrei Alexandrescu
 
Generic Programming Galore Using D
Generic Programming Galore Using DGeneric Programming Galore Using D
Generic Programming Galore Using D
 
Three, no, Four Cool Things About D
Three, no, Four Cool Things About DThree, no, Four Cool Things About D
Three, no, Four Cool Things About D
 
iOS 6 Exploitation 280 days later
iOS 6 Exploitation 280 days lateriOS 6 Exploitation 280 days later
iOS 6 Exploitation 280 days later
 
UTF-8
UTF-8UTF-8
UTF-8
 
Informatica 1_1_.pdf;filename_= utf-8''informatica_(1)[1]
Informatica  1_1_.pdf;filename_= utf-8''informatica_(1)[1]Informatica  1_1_.pdf;filename_= utf-8''informatica_(1)[1]
Informatica 1_1_.pdf;filename_= utf-8''informatica_(1)[1]
 
Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4
Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4
Muslim rule lect_4.ppt_filename_= utf-8''muslim rule lect 4
 
C++ on the Web: Run your big 3D game in the browser
C++ on the Web: Run your big 3D game in the browserC++ on the Web: Run your big 3D game in the browser
C++ on the Web: Run your big 3D game in the browser
 
Deep C Programming
Deep C ProgrammingDeep C Programming
Deep C Programming
 
C++11
C++11C++11
C++11
 
C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)C++ idioms by example (Nov 2008)
C++ idioms by example (Nov 2008)
 
Insecure coding in C (and C++)
Insecure coding in C (and C++)Insecure coding in C (and C++)
Insecure coding in C (and C++)
 
Code Optimization
Code OptimizationCode Optimization
Code Optimization
 
Code generation
Code generationCode generation
Code generation
 

Similar to Three Optimization Tips for C++

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
Mike Acton
 
6-Query_Intro (5).pdf
6-Query_Intro (5).pdf6-Query_Intro (5).pdf
6-Query_Intro (5).pdf
JaveriaShoaib4
 
Writing Readable Code
Writing Readable CodeWriting Readable Code
Writing Readable Code
eddiehaber
 
Database , 6 Query Introduction
Database , 6 Query Introduction Database , 6 Query Introduction
Database , 6 Query Introduction
Ali Usman
 
How To Handle Your Tech Debt Better - Sean Moir
How To Handle Your Tech Debt Better - Sean MoirHow To Handle Your Tech Debt Better - Sean Moir
How To Handle Your Tech Debt Better - Sean Moir
Mike Harris
 
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Anne Nicolas
 
Performance #5 cpu and battery
Performance #5  cpu and batteryPerformance #5  cpu and battery
Performance #5 cpu and battery
Vitali Pekelis
 
4 colin walls - self-testing in embedded systems
4   colin walls - self-testing in embedded systems4   colin walls - self-testing in embedded systems
4 colin walls - self-testing in embedded systems
Ievgenii Katsan
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
Nesreen K. Ahmed
 
Optimizing Browser Rendering
Optimizing Browser RenderingOptimizing Browser Rendering
Optimizing Browser Rendering
michael.labriola
 
Disaster Recovery with MySQL and Tungsten
Disaster Recovery with MySQL and TungstenDisaster Recovery with MySQL and Tungsten
Disaster Recovery with MySQL and Tungsten
Jeff Mace
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
MongoDB
 
Where Did All These Cycles Go?
Where Did All These Cycles Go?Where Did All These Cycles Go?
Where Did All These Cycles Go?
ScyllaDB
 
SOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principlesSOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principles
Sergey Karpushin
 
Cassandra introduction mars jug
Cassandra introduction mars jugCassandra introduction mars jug
Cassandra introduction mars jug
Duyhai Doan
 
Clean code, Feb 2012
Clean code, Feb 2012Clean code, Feb 2012
Clean code, Feb 2012
cobyst
 
Pragmatic Performance from NDC Oslo 2019
Pragmatic Performance from NDC Oslo 2019Pragmatic Performance from NDC Oslo 2019
Pragmatic Performance from NDC Oslo 2019
David Wengier
 
Interactive DSML Design
Interactive DSML DesignInteractive DSML Design
Interactive DSML Design
Andriy Levytskyy
 
BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning Talk
Doug Chang
 
Cassandra On EC2
Cassandra On EC2Cassandra On EC2
Cassandra On EC2
Matthew Dennis
 

Similar to Three Optimization Tips for C++ (20)

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
6-Query_Intro (5).pdf
6-Query_Intro (5).pdf6-Query_Intro (5).pdf
6-Query_Intro (5).pdf
 
Writing Readable Code
Writing Readable CodeWriting Readable Code
Writing Readable Code
 
Database , 6 Query Introduction
Database , 6 Query Introduction Database , 6 Query Introduction
Database , 6 Query Introduction
 
How To Handle Your Tech Debt Better - Sean Moir
How To Handle Your Tech Debt Better - Sean MoirHow To Handle Your Tech Debt Better - Sean Moir
How To Handle Your Tech Debt Better - Sean Moir
 
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
 
Performance #5 cpu and battery
Performance #5  cpu and batteryPerformance #5  cpu and battery
Performance #5 cpu and battery
 
4 colin walls - self-testing in embedded systems
4   colin walls - self-testing in embedded systems4   colin walls - self-testing in embedded systems
4 colin walls - self-testing in embedded systems
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
 
Optimizing Browser Rendering
Optimizing Browser RenderingOptimizing Browser Rendering
Optimizing Browser Rendering
 
Disaster Recovery with MySQL and Tungsten
Disaster Recovery with MySQL and TungstenDisaster Recovery with MySQL and Tungsten
Disaster Recovery with MySQL and Tungsten
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
 
Where Did All These Cycles Go?
Where Did All These Cycles Go?Where Did All These Cycles Go?
Where Did All These Cycles Go?
 
SOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principlesSOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principles
 
Cassandra introduction mars jug
Cassandra introduction mars jugCassandra introduction mars jug
Cassandra introduction mars jug
 
Clean code, Feb 2012
Clean code, Feb 2012Clean code, Feb 2012
Clean code, Feb 2012
 
Pragmatic Performance from NDC Oslo 2019
Pragmatic Performance from NDC Oslo 2019Pragmatic Performance from NDC Oslo 2019
Pragmatic Performance from NDC Oslo 2019
 
Interactive DSML Design
Interactive DSML DesignInteractive DSML Design
Interactive DSML Design
 
BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning Talk
 
Cassandra On EC2
Cassandra On EC2Cassandra On EC2
Cassandra On EC2
 

Three Optimization Tips for C++

  • 1. Three Optimization Tips for C++ Andrei Alexandrescu, Ph.D. Research Scientist, Facebook andrei.alexandrescu@fb.com © 2012- Facebook. Do not redistribute. 1 / 33
  • 2. This Talk • Basics • Reduce strength • Minimize array writes © 2012- Facebook. Do not redistribute. 2 / 33
  • 3. Things I Shouldn’t Even © 2012- Facebook. Do not redistribute. 3 / 33
  • 4. Today’s Computing Architectures • Extremely complex • Trade reproducible performance for average speed • Interrupts, multiprocessing are the norm • Dynamic frequency control is becoming common • Virtually impossible to get identical timings for experiments © 2012- Facebook. Do not redistribute. 4 / 33
  • 5. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions © 2012- Facebook. Do not redistribute. 5 / 33
  • 6. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” © 2012- Facebook. Do not redistribute. 5 / 33
  • 7. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” © 2012- Facebook. Do not redistribute. 5 / 33
  • 8. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” © 2012- Facebook. Do not redistribute. 5 / 33
  • 9. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” © 2012- Facebook. Do not redistribute. 5 / 33
  • 10. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” • “Computation is faster than data” © 2012- Facebook. Do not redistribute. 5 / 33
  • 11. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” • “Computation is faster than data” © 2012- Facebook. Do not redistribute. 5 / 33
  • 12. Intuition • Ignores aspects of a complex reality • Makes narrow/obsolete/wrong assumptions • “Fewer instructions = faster code” • “Data is faster than computation” • “Computation is faster than data” • The only good intuition: “I should time this.” © 2012- Facebook. Do not redistribute. 5 / 33
  • 13. Paradox Measuring gives you a leg up on experts who don’t need to measure © 2012- Facebook. Do not redistribute. 6 / 33
  • 14. Common Pitfalls • Measuring speed of debug builds • Different setup for baseline and measured ◦ Sequencing: heap allocator ◦ Warmth of cache, files, databases, DNS • Including ancillary work in measurement ◦ malloc, printf common • Mixtures: measure ta + tb , improve ta , conclude tb got improved • Optimize rare cases, pessimize others © 2012- Facebook. Do not redistribute. 7 / 33
  • 15. Optimizing Rare Cases © 2012- Facebook. Do not redistribute. 8 / 33
  • 16. More generalities • Prefer static linking and PDC • Prefer 64-bit code, 32-bit data • Prefer (32-bit) array indexing to pointers ◦ Prefer a[i++] to a[++i] • Prefer regular memory access patterns • Minimize flow, avoid data dependencies © 2012- Facebook. Do not redistribute. 9 / 33
  • 17. Storage Pecking Order • Use enum for integral constants • Use static const for other immutables ◦ Beware cache issues • Use stack for most variables • Globals: aliasing issues • thread_local slowest, use local caching ◦ 1 instruction in Windows, Linux ◦ 3-4 in OSX © 2012- Facebook. Do not redistribute. 10 / 33
  • 18. Reduce Strength © 2012- Facebook. Do not redistribute. 11 / 33
  • 19. Strength reduction • Speed hierarchy: ◦ comparisons ◦ (u)int add, subtract, bitops, shift ◦ FP add, sub (separate unit!) ◦ Indexed array access ◦ (u)int32 mul; FP mul ◦ FP division, remainder ◦ (u)int division, remainder © 2012- Facebook. Do not redistribute. 12 / 33
  • 20. Your Compiler Called I get it. a >>= 1 is the same as a /= 2. © 2012- Facebook. Do not redistribute. 13 / 33
  • 21. Integrals • Prefer 32-bit ints to all other sizes ◦ 64 bit may make some code slower ◦ 8, 16-bit computations use conversion to 32 bits and back ◦ Use small ints in arrays • Prefer unsigned to signed ◦ Except when converting to floating point • “Most numbers are small” © 2012- Facebook. Do not redistribute. 14 / 33
  • 22. Floating Point • Double precision as fast as single precision • Extended precision just a bit slower • Do not mix the three • 1-2 FP addition/subtraction units • 1-2 FP multiplication/division units • SSE accelerates throughput for certain computation kernels • ints→FPs cheap, FPs→ints expensive © 2012- Facebook. Do not redistribute. 15 / 33
  • 23. Advice Design algorithms to use minimum operation strength © 2012- Facebook. Do not redistribute. 16 / 33
  • 24. Strength reduction: Example • Digit count in base-10 representation uint32_t digits10(uint64_t v) { uint32_t result = 0; do { ++result; v /= 10; } while (v); return result; } • Uses integral division extensively ◦ (Actually: multiplication) © 2012- Facebook. Do not redistribute. 17 / 33
  • 25. Strength reduction: Example uint32_t digits10(uint64_t v) { uint32_t result = 1; for (;;) { if (v < 10) return result; if (v < 100) return result + 1; if (v < 1000) return result + 2; if (v < 10000) return result + 3; // Skip ahead by 4 orders of magnitude v /= 10000U; result += 4; } } • More comparisons and additions, fewer /= • (This is not loop unrolling!) © 2012- Facebook. Do not redistribute. 18 / 33
  • 26.
  • 27. Minimize Array Writes © 2012- Facebook. Do not redistribute. 20 / 33
  • 28. Minimize Array Writes: Why? • Disables enregistering • A write is really a read and a write • Aliasing makes things difficult • Maculates the cache • Generally just difficult to optimize © 2012- Facebook. Do not redistribute. 21 / 33
  • 29. Minimize Array Writes uint32_t u64ToAsciiClassic(uint64_t value, char* dst) { // Write backwards. auto start = dst; do { *dst++ = ’0’ + (value % 10); value /= 10; } while (value != 0); const uint32_t result = dst - start; // Reverse in place. for (dst--; dst > start; start++, dst--) { std::iter_swap(dst, start); } return result; } © 2012- Facebook. Do not redistribute. 22 / 33
  • 30. Minimize Array Writes • Gambit: make one extra pass to compute length uint32_t uint64ToAscii(uint64_t v, char *const buffer) { auto const result = digits10(v); uint32_t pos = result - 1; while (v >= 10) { auto const q = v / 10; auto const r = static_cast<uint32_t>(v % 10); buffer[pos--] = ’0’ + r; v = q; } assert(pos == 0); // Last digit is trivial to handle *buffer = static_cast<uint32_t>(v) + ’0’; return result; } © 2012- Facebook. Do not redistribute. 23 / 33
  • 31. Improvements • Fewer array writes • Regular access patterns • Fast on small numbers • Data dependencies reduced © 2012- Facebook. Do not redistribute. 24 / 33
  • 32.
  • 33. One More Pass • Reformulate digits10 as search • Convert two digits at a time © 2012- Facebook. Do not redistribute. 26 / 33
  • 34. uint32_t digits10(uint64_t v) { if (v < P01) return 1; if (v < P02) return 2; if (v < P03) return 3; if (v < P12) { if (v < P08) { if (v < P06) { if (v < P04) return 4; return 5 + (v < P05); } return 7 + (v >= P07); } if (v < P10) { return 9 + (v >= P09); } return 11 + (v >= P11); } return 12 + digits10(v / P12); } © 2012- Facebook. Do not redistribute. 27 / 33
  • 35. unsigned u64ToAsciiTable(uint64_t value, char* dst) { static const char digits[201] = "0001020304050607080910111213141516171819" "2021222324252627282930313233343536373839" "4041424344454647484950515253545556575859" "6061626364656667686970717273747576777879" "8081828384858687888990919293949596979899"; uint32_t const length = digits10(value); uint32_t next = length - 1; while (value >= 100) { auto const i = (value % 100) * 2; value /= 100; dst[next] = digits[i + 1]; dst[next - 1] = digits[i]; next -= 2; } © 2012- Facebook. Do not redistribute. 28 / 33
  • 36. // Handle last 1-2 digits if (value < 10) { dst[next] = ’0’ + uint32_t(value); } else { auto i = uint32_t(value) * 2; dst[next] = digits[i + 1]; dst[next - 1] = digits[i]; } return length; } © 2012- Facebook. Do not redistribute. 29 / 33
  • 37.
  • 38.
  • 39. Summary © 2012- Facebook. Do not redistribute. 32 / 33
  • 40. Summary • You can’t improve what you can’t measure ◦ Pro tip: You can’t measure what you don’t measure • Reduce strength • Minimize array writes © 2012- Facebook. Do not redistribute. 33 / 33