SlideShare a Scribd company logo
1 of 6
Download to read offline
CS 33 Intro Computer Systems Doeppner
Optimization Techniques in C
Fall, 2014
1 Code Motion
Code motion involves identifying bits of code that occur within loops, but need only be executed
once during that particular loop. A good example of such an operation is including a function call
in a for or while loop header — assuming that the header function’s return value will be constant
for the duration of the loop, the header function will uneccesarily execute during each iteration of
the loop.
For example, the following code shifts each character of a string.
void shift_char(char *str){
int i;
for(i=0;i<strlen(str);i++){
str[i]++;
}
}
However, there is no reason to calculate the length of the string on every iteration of the loop.
Thus, we can optimize the function by moving the function call strlen out of the loop:
void shift_char(char *str){
int i;
int len=strlen(str)
for(i=0;i<len;i++){
str[i]++;
}
}
The new version of the function only needs to calculate the value of the function once, saving time
in its execution.
2 Loop Unrolling
Another common optimization technique is called loop unrolling — this entails repeating the code
that will be executed multiple times in a loop, so that fewer loop iterations need to be made. The
concept is simple if counterintuitive, since programmers are used to using loops to avoid exactly this
sort of repetition (which is why this type of optimization is usually done by the compiler, rather
than by humans).
Here is a simple example. In the unoptimized program, a while loop is used to call a certain
function some finite, but large, number of times. Loops like this one are very common, and you
have surely written numerous similar code snippets over the course of your CS career:
CS 33 Optimization Techniques in C Fall, 2014
int i = 0;
while (i < num) {
a_certain_function(i);
i++;
}
There is a problem here, however: loops incur a certain amount of overhead, especially when
each iteration does only a small amount of work (such as calling a single function). The conditional
branch at the top of the loop, and the unconditional return jump at the bottom, are both operations
that take a non-trivial number of cycles to complete, and thus slow the program down, albeit by a
small amount. Thus, assuming that num is a multiple of 4, the following optimized code might be
preferable to the original:
int i = 0;
while (i < num) {
a_certain_function(i);
a_certain_function(i+1);
a_certain_function(i+2);
a_certain_function(i+3);
i += 4;
}
In this version, although the actual code visible to the user (and the binary) is longer, the program
will take less time to run, because it need only loop num/4 times to compute the same results as
the original version.
Loop unrolling does have one major drawback: it makes programs significantly longer (in terms
of lines of code and size of the compiled binary) than they would be otherwise. As in many cases
in computer science, the decision of whether to unroll or not to unroll is a tradeoff between time
and program size. For some applications, the added binary length might be worth the decreased
runtime; however, there are also applications for which a smaller binary is more important than
a shorter runtime (for example, if one were coding for a machine with very little RAM, or were
writing a program that was already very large).
3 Inlining
Inlining is the process by which the contents of a function are “inlined” — basically, copied and
pasted — instead of a traditional call to that function. This form of optimization avoids the
overhead of function calls, by eliminating the need to jump, create a new stack frame, and reverse
this process at the end of the function.
For example, consider the following piece of code:
2
CS 33 Optimization Techniques in C Fall, 2014
int get_random_number(void) {
return 4; // chosen by fair die roll
// guaranteed to be random
}
int main(int argc, char **argv) {
int a = get_random_number();
printf("%dn", a);
return 0;
}
Making a whole function call just to retrieve the number 4 seems a bit inefficient, and gcc would
agree with you on that. To optimize this function, the code inside of get random number() (which
happens to just be the integer 4) would be substituted for the function call in the body of main().
Thus, after optimization, the function would look like this:
int main(int argc, char **argv) {
int a = 4;
printf("%dn", a);
return 0;
}
A smart compiler might even eliminate the temporary variable a, although for the sake of this
example let us assume that it does not. The resultant program does not make any function calls;
however, it retrieves the same result as weas provided in the original version. This program will
run much faster than the original, given that it doesn’t have to jump, create a new stack frame,
and undo all of that time-consuming computation just to get the number 4.
Once inlining is brought into the picture, the distinction between “functions” and “macros” may
seem a bit fuzzier. However, there remain important semantic differences between the two. Most
importantly, a function always defines its own scope — that is, local variables defined within a
function may only be accessed from within that function (in C, this is true of any block of code
surrounded by curly braces). When inlining more complex functions with lots of local variables,
things get much more complex. We won’t get into the specifics here (this is a systems course, not
a semantics course), but let it suffice to say that the process is a bit more complicated than simply
copying and pasting the contents of one function into another.
4 Writing Cache-Friendly Code
Another optimization commonly made to programs is cache-friendly code — that is, code whose
organization and design utilizes the machine’s cache in the most efficient way possible.
Caches, as you have seen in lecture, store lines containing bytes that are located consecutively in
main memory. This design is based on the principle of spatial locality — that is, those memory
regions that are physically closer together are more likely to be accessed within a short time of one
another than those spread more thinly in RAM. In order to write cache-friendly code, a programmer
must respect this principle, by writing programs that primarily access closely-grouped memory
locations sequentially or simultaneously.
3
CS 33 Optimization Techniques in C Fall, 2014
For example, the following code will visit every entry in a two-dimensional array, and add 1 to each:
for (j = 0; j < NUM_COLS; j++) {
for (i = 0; i < NUM_ROWS; i++) {
array[i][j] += 1;
}
}
However, there is a problem here: two-dimensional arrays in C are actually stored as one-dimensional
arrays, with rows intact and listed one after the other, such that the underlying index of an ele-
ment can be computed by row * rowlen + col (In technical terminology, this is called row-major
order). When we iterate through the array in the above example, however, we iterate in column-
major order1. Unless our cache lines are very large, or the array is very small, this poses a problem,
because sequential memory accesses will each be separated from one another by NUM COLS. Thus,
if NUM COLS is greater than the length of a cache line, and NUM ROWS is greater than the number of
lines in the cache, we run into the following problem:
On each of the first n iterations (where n is the number of lines in the cache), the program suffers a
cache miss, since the next byte needed is not in any of the cache lines that are present in memory.
This is not a huge problem because we started with a cold cache, and some number of misses were
bound to be required in order to warm it up. However, on the next iteration, we again suffer a
miss, and since the cache is now full, one of the current cache lines must be evicted to make room
for the new line. Since each read requests an address that is on a different line and there are not
enough lines to store the entire array in the cache, a miss occurs on every iteration, slowing the
program down considerably. This is unacceptable — if we’re to have a miss each time we access
memory, we might as well not have the cache at all. Luckily, there is a simple fix: reverse the order
in which the rows and columns of the array are traversed:
for (i = 0; i < NUM_ROWS; i++) {
for (j = 0; j < NUM_COLS; j++) {
array[i][j] += 1;
}
}
Now we are iterating through the columns on the inside loop, and the rows on the outside. Turning
our attention back to the cache, we see a different pattern: The first memory access is a miss, since
the cache is cold. However, the next n − 1 accesses are hits, since the requisite values were loaded
into the cache with the first as part of the cache line. Only then do we encounter another miss.
This pattern continues, and misses become much less frequent than before.
5 Cache Blocking
What happens though when each iteration accesses two non-spatially related elements? For exam-
ple, the product of two matrices accesses the rows of the first matrix and the columns of the second
matrix. An example of this function is as follows:
1
Which is just like row-major order, except that the columns are intact and listed one after another.
4
CS 33 Optimization Techniques in C Fall, 2014
void matrixproduct(double *a, double *b, double *c, int n){
int i,j,k;
for(i=0; i<n; i++){
for(j=0; j<n; j++){
for(k=0; k<n; k++){
c[i*n+j] += a[i*n+k]*b[k*n+j];
}
}
}
}
This will generate a lot of cache misses like in the previous example. If we assume the cache block
to be 8 doubles in size, after the first iteration, there are n/8 + n cache misses. This is from the
fact that the entire column of the second matrix is traversed. If we were to look at the cache, the
last 8 doubles of the first matrix would be contained. Each cache miss results in loading a single
block into the cache, the size of which is 8 doubles. Thus, there are n/8 cache misses. However, for
the second matrix, 8 doubles from each of the last C rows would be contained where C is the size
of the cache. Since every row of the column would not be contained in the cache, there are n cache
misses. With n2 iterations, there are on the order of n3 cache misses! It would be nice if we could
take advantage rows of the second matrix despite traversing the columns.
Luckily, the matrix product algorithm lends itself to a blocking solution. That is, instead of dealing
with rows of the first matrix and columns of the second matrix, we deal with a block of the first
matrix and a block of the second matrix. Since the operations are communicative, the order doesn’t
matter. An example of a blocked matrix product is as follows:
void matrixproduct(double *a, double *b, double *c, int n){
int i,j,k;
for(i=0; i<n; i+=B){
for(j=0; j<n; j+=B){
for(k=0; k<n; k+B){
//Dealing with the product of the block
for(i2=i; i2<i+B; i2++){
for(j2=j; j2<j+B; j2++){
for(k2=k; k2<k+B; k2++){
c[i2*n+j2] += a[i2*n+k2]*b[k2*n+j2];
}
}
}
}
}
}
}
We will make the same cache size assumptions and assume that around 3 blocks fit into the cache.
Since each block is of size BxB, if we are computing the product of 2 blocks we need 3B2 space in
the cache to hold the multiplicand, the multiplier, and the product, with 3B2 < C. (This assumes
5
CS 33 Optimization Techniques in C Fall, 2014
the units of B and C are bytes.) Since a cache line is 64 byes, each line holds 8 doubles. Thus the
first access of a cache-line-aligned block results in a miss, but brings in 8 doubles. And thus 1/8 of
the accesses to matrix elements (each of which is a double) result in a miss, or B2
8 misses per block.
In each iteration, 2n
B blocks are examined (n
b for the multiplicand and n
b for the multiplier). Since
each block results in B2
8 misses, there are (2n
B )(B2
8 ) = nB
4 misses/iteration.
Another improvement is that we now have fewer iterations since we are dealing with a whole block
of data at once. We have ( n
B )2 iterations leading to a total of ( n
B )2(nB
4 ) = 1
4B n3 cache misses. By
using the largest possible block size for our cache, we can see that we get far fewer cache missses
by blocking.
Blocked solutions are not always easy to see or create but when used can make code much faster.
Complete knowledge of the algorithms used are needed in order to make the most efficient code.
6 Writing Pipeline-Friendly Code
Finally, optimized programs must take into account how the machine’s pipeline will affect their
performance. As you have seen in lecture, the nature of a pipeline makes it such that, after some
instructions (such as conditional branches), it is impossible to predict with perfect accuracy which
instruction will be executed next. However, since knowing the following instruction, even some
of the time, helps with efficiency, processor designers make every effort to anticipate the next
instruction as often as possible in these cases.
Branch prediction, however, only gets it right so often, so it’s up to the programmer to make
sure that it is easy for the processor to predict branches. There are a number of different branch
prediction schemes, which are detailed in your book and in lecture.
However, as was mentioned in the section on loop unrolling, the best solution is to avoid branching
as much as possible, since mispredicted branches are fairly expensive, and it is difficult to know the
branch prediction scheme of the machine on which your code will be run. In many processors, the
branch predictor is good enough that there is very little the programmer can do to help — thus,
while you shouldn’t go to too great lengths to avoid branches, you should be cognizant of where
they appear in your code and ask yourself if they are unnecessary.
6

More Related Content

What's hot

A peek on numerical programming in perl and python e christopher dyken 2005
A peek on numerical programming in perl and python  e christopher dyken  2005A peek on numerical programming in perl and python  e christopher dyken  2005
A peek on numerical programming in perl and python e christopher dyken 2005Jules Krdenas
 
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...zaidinvisible
 
C++11: Feel the New Language
C++11: Feel the New LanguageC++11: Feel the New Language
C++11: Feel the New Languagemspline
 
Learn C# Programming Polymorphism & Operator Overloading
Learn C# Programming Polymorphism & Operator OverloadingLearn C# Programming Polymorphism & Operator Overloading
Learn C# Programming Polymorphism & Operator OverloadingEng Teong Cheah
 
OmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMPOmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMPIntel IT Center
 
Docase notation for Haskell
Docase notation for HaskellDocase notation for Haskell
Docase notation for HaskellTomas Petricek
 
Hidden Truths in Dead Software Paths
Hidden Truths in Dead Software PathsHidden Truths in Dead Software Paths
Hidden Truths in Dead Software PathsBen Hermann
 
Chapter Eight(1)
Chapter Eight(1)Chapter Eight(1)
Chapter Eight(1)bolovv
 
100 c interview questions answers
100 c interview questions answers100 c interview questions answers
100 c interview questions answersSareen Kumar
 

What's hot (20)

Modern C++
Modern C++Modern C++
Modern C++
 
A peek on numerical programming in perl and python e christopher dyken 2005
A peek on numerical programming in perl and python  e christopher dyken  2005A peek on numerical programming in perl and python  e christopher dyken  2005
A peek on numerical programming in perl and python e christopher dyken 2005
 
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...
Aes cryptography algorithm based on intelligent blum blum-shub prn gs publica...
 
C++11: Feel the New Language
C++11: Feel the New LanguageC++11: Feel the New Language
C++11: Feel the New Language
 
Rx workshop
Rx workshopRx workshop
Rx workshop
 
15 Jo P Mar 08
15 Jo P Mar 0815 Jo P Mar 08
15 Jo P Mar 08
 
Concurrency with Go
Concurrency with GoConcurrency with Go
Concurrency with Go
 
Learn C# Programming Polymorphism & Operator Overloading
Learn C# Programming Polymorphism & Operator OverloadingLearn C# Programming Polymorphism & Operator Overloading
Learn C# Programming Polymorphism & Operator Overloading
 
OmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMPOmpSs – improving the scalability of OpenMP
OmpSs – improving the scalability of OpenMP
 
String handling
String handlingString handling
String handling
 
Docase notation for Haskell
Docase notation for HaskellDocase notation for Haskell
Docase notation for Haskell
 
Hidden Truths in Dead Software Paths
Hidden Truths in Dead Software PathsHidden Truths in Dead Software Paths
Hidden Truths in Dead Software Paths
 
Computer Science Assignment Help
Computer Science Assignment HelpComputer Science Assignment Help
Computer Science Assignment Help
 
Chapter Eight(1)
Chapter Eight(1)Chapter Eight(1)
Chapter Eight(1)
 
100 c interview questions answers
100 c interview questions answers100 c interview questions answers
100 c interview questions answers
 
C q 3
C q 3C q 3
C q 3
 
Summary of C++17 features
Summary of C++17 featuresSummary of C++17 features
Summary of C++17 features
 
Programming Homework Help
Programming Homework Help Programming Homework Help
Programming Homework Help
 
C Assignment Help
C Assignment HelpC Assignment Help
C Assignment Help
 
Software Construction Assignment Help
Software Construction Assignment HelpSoftware Construction Assignment Help
Software Construction Assignment Help
 

Viewers also liked

Ppt uas admin maya
Ppt uas admin mayaPpt uas admin maya
Ppt uas admin mayamaya38
 
Introduction to Donaldson and Company
Introduction to Donaldson and CompanyIntroduction to Donaldson and Company
Introduction to Donaldson and Companythaistartupreview
 
SEO Training in Mahabubnagar
SEO Training in MahabubnagarSEO Training in Mahabubnagar
SEO Training in MahabubnagarSubhash Malgam
 
การศ กษาและค นคว_าอ_สละ
การศ กษาและค นคว_าอ_สละการศ กษาและค นคว_าอ_สละ
การศ กษาและค นคว_าอ_สละNuumint
 
การศ กษาและค นคว_าอ_สละ
การศ กษาและค นคว_าอ_สละการศ กษาและค นคว_าอ_สละ
การศ กษาและค นคว_าอ_สละNuumint
 
ว ธ สร_างบล_อกก_บ blogger
ว ธ สร_างบล_อกก_บ bloggerว ธ สร_างบล_อกก_บ blogger
ว ธ สร_างบล_อกก_บ bloggerNuumint
 
ประว ต ไมโครซอฟต_
ประว ต ไมโครซอฟต_ประว ต ไมโครซอฟต_
ประว ต ไมโครซอฟต_Nuumint
 
1 nejdulezitejsi pojmy a otazky souvisejici s verejnymi zakazkami v polsku
1 nejdulezitejsi pojmy a otazky souvisejici s verejnymi zakazkami v polsku1 nejdulezitejsi pojmy a otazky souvisejici s verejnymi zakazkami v polsku
1 nejdulezitejsi pojmy a otazky souvisejici s verejnymi zakazkami v polskuondrejbaarcz
 
การศ กษาและค นคว_าอ_สละ (2)
การศ กษาและค นคว_าอ_สละ (2)การศ กษาและค นคว_าอ_สละ (2)
การศ กษาและค นคว_าอ_สละ (2)Nuumint
 
เทคน คการส บค_นบน google
เทคน คการส บค_นบน googleเทคน คการส บค_นบน google
เทคน คการส บค_นบน googleNuumint
 
Future tense
Future tenseFuture tense
Future tenseIsnaini22
 
Managementbijeenkomst ktv
Managementbijeenkomst ktvManagementbijeenkomst ktv
Managementbijeenkomst ktvJulian Laan
 
เทคน คการส บค_นบน google
เทคน คการส บค_นบน googleเทคน คการส บค_นบน google
เทคน คการส บค_นบน googleNuumint
 
ว ธ สร_างบล_อกก_บ blogger
ว ธ สร_างบล_อกก_บ bloggerว ธ สร_างบล_อกก_บ blogger
ว ธ สร_างบล_อกก_บ bloggerNuumint
 
ข นตอนการต_ดต__ง โปรแกรม windows 8 บน vmware
ข  นตอนการต_ดต__ง โปรแกรม windows  8 บน vmwareข  นตอนการต_ดต__ง โปรแกรม windows  8 บน vmware
ข นตอนการต_ดต__ง โปรแกรม windows 8 บน vmwareNuumint
 

Viewers also liked (20)

Ppt uas admin maya
Ppt uas admin mayaPpt uas admin maya
Ppt uas admin maya
 
2
22
2
 
Introduction to Donaldson and Company
Introduction to Donaldson and CompanyIntroduction to Donaldson and Company
Introduction to Donaldson and Company
 
SEO Training in Mahabubnagar
SEO Training in MahabubnagarSEO Training in Mahabubnagar
SEO Training in Mahabubnagar
 
การศ กษาและค นคว_าอ_สละ
การศ กษาและค นคว_าอ_สละการศ กษาและค นคว_าอ_สละ
การศ กษาและค นคว_าอ_สละ
 
การศ กษาและค นคว_าอ_สละ
การศ กษาและค นคว_าอ_สละการศ กษาและค นคว_าอ_สละ
การศ กษาและค นคว_าอ_สละ
 
ว ธ สร_างบล_อกก_บ blogger
ว ธ สร_างบล_อกก_บ bloggerว ธ สร_างบล_อกก_บ blogger
ว ธ สร_างบล_อกก_บ blogger
 
ประว ต ไมโครซอฟต_
ประว ต ไมโครซอฟต_ประว ต ไมโครซอฟต_
ประว ต ไมโครซอฟต_
 
Narkoba
NarkobaNarkoba
Narkoba
 
1 nejdulezitejsi pojmy a otazky souvisejici s verejnymi zakazkami v polsku
1 nejdulezitejsi pojmy a otazky souvisejici s verejnymi zakazkami v polsku1 nejdulezitejsi pojmy a otazky souvisejici s verejnymi zakazkami v polsku
1 nejdulezitejsi pojmy a otazky souvisejici s verejnymi zakazkami v polsku
 
การศ กษาและค นคว_าอ_สละ (2)
การศ กษาและค นคว_าอ_สละ (2)การศ กษาและค นคว_าอ_สละ (2)
การศ กษาและค นคว_าอ_สละ (2)
 
เทคน คการส บค_นบน google
เทคน คการส บค_นบน googleเทคน คการส บค_นบน google
เทคน คการส บค_นบน google
 
Wonderful waves
Wonderful wavesWonderful waves
Wonderful waves
 
Future tense
Future tenseFuture tense
Future tense
 
Managementbijeenkomst ktv
Managementbijeenkomst ktvManagementbijeenkomst ktv
Managementbijeenkomst ktv
 
Blogger
BloggerBlogger
Blogger
 
เทคน คการส บค_นบน google
เทคน คการส บค_นบน googleเทคน คการส บค_นบน google
เทคน คการส บค_นบน google
 
ว ธ สร_างบล_อกก_บ blogger
ว ธ สร_างบล_อกก_บ bloggerว ธ สร_างบล_อกก_บ blogger
ว ธ สร_างบล_อกก_บ blogger
 
ข นตอนการต_ดต__ง โปรแกรม windows 8 บน vmware
ข  นตอนการต_ดต__ง โปรแกรม windows  8 บน vmwareข  นตอนการต_ดต__ง โปรแกรม windows  8 บน vmware
ข นตอนการต_ดต__ง โปรแกรม windows 8 บน vmware
 
e-commerse website
e-commerse websitee-commerse website
e-commerse website
 

Similar to C optimization notes

Function Overloading,Inline Function and Recursion in C++ By Faisal Shahzad
Function Overloading,Inline Function and Recursion in C++ By Faisal ShahzadFunction Overloading,Inline Function and Recursion in C++ By Faisal Shahzad
Function Overloading,Inline Function and Recursion in C++ By Faisal ShahzadFaisal Shehzad
 
Objectives Assignment 09 Applications of Stacks COS.docx
Objectives Assignment 09 Applications of Stacks COS.docxObjectives Assignment 09 Applications of Stacks COS.docx
Objectives Assignment 09 Applications of Stacks COS.docxdunhamadell
 
A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.Jason Hearne-McGuiness
 
Inline function
Inline functionInline function
Inline functionTech_MX
 
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2PVS-Studio
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projectsPVS-Studio
 
Linux Kernel, tested by the Linux-version of PVS-Studio
Linux Kernel, tested by the Linux-version of PVS-StudioLinux Kernel, tested by the Linux-version of PVS-Studio
Linux Kernel, tested by the Linux-version of PVS-StudioPVS-Studio
 
C interview questions
C interview questionsC interview questions
C interview questionsSoba Arjun
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects Andrey Karpov
 
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala Part 2 ...
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala Part 2 ...Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala Part 2 ...
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala Part 2 ...Philip Schwarz
 
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala - Part 2
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala - Part 2Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala - Part 2
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala - Part 2Philip Schwarz
 
Taking r to its limits. 70+ tips
Taking r to its limits. 70+ tipsTaking r to its limits. 70+ tips
Taking r to its limits. 70+ tipsIlya Shutov
 
C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1ReKruiTIn.com
 

Similar to C optimization notes (20)

Analyzing algorithms
Analyzing algorithmsAnalyzing algorithms
Analyzing algorithms
 
Function Overloading,Inline Function and Recursion in C++ By Faisal Shahzad
Function Overloading,Inline Function and Recursion in C++ By Faisal ShahzadFunction Overloading,Inline Function and Recursion in C++ By Faisal Shahzad
Function Overloading,Inline Function and Recursion in C++ By Faisal Shahzad
 
Objectives Assignment 09 Applications of Stacks COS.docx
Objectives Assignment 09 Applications of Stacks COS.docxObjectives Assignment 09 Applications of Stacks COS.docx
Objectives Assignment 09 Applications of Stacks COS.docx
 
A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.
 
Inline function
Inline functionInline function
Inline function
 
C++ Homework Help
C++ Homework HelpC++ Homework Help
C++ Homework Help
 
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
Analysis of Haiku Operating System (BeOS Family) by PVS-Studio. Part 2
 
Buffer overflow attack
Buffer overflow attackBuffer overflow attack
Buffer overflow attack
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 
Matopt
MatoptMatopt
Matopt
 
Linux Kernel, tested by the Linux-version of PVS-Studio
Linux Kernel, tested by the Linux-version of PVS-StudioLinux Kernel, tested by the Linux-version of PVS-Studio
Linux Kernel, tested by the Linux-version of PVS-Studio
 
Cocoa heads 09112017
Cocoa heads 09112017Cocoa heads 09112017
Cocoa heads 09112017
 
C interview questions
C interview questionsC interview questions
C interview questions
 
C Programming Homework Help
C Programming Homework HelpC Programming Homework Help
C Programming Homework Help
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala Part 2 ...
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala Part 2 ...Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala Part 2 ...
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala Part 2 ...
 
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala - Part 2
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala - Part 2Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala - Part 2
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala - Part 2
 
CPP Homework Help
CPP Homework HelpCPP Homework Help
CPP Homework Help
 
Taking r to its limits. 70+ tips
Taking r to its limits. 70+ tipsTaking r to its limits. 70+ tips
Taking r to its limits. 70+ tips
 
C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1
 

Recently uploaded

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 

Recently uploaded (20)

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 

C optimization notes

  • 1. CS 33 Intro Computer Systems Doeppner Optimization Techniques in C Fall, 2014 1 Code Motion Code motion involves identifying bits of code that occur within loops, but need only be executed once during that particular loop. A good example of such an operation is including a function call in a for or while loop header — assuming that the header function’s return value will be constant for the duration of the loop, the header function will uneccesarily execute during each iteration of the loop. For example, the following code shifts each character of a string. void shift_char(char *str){ int i; for(i=0;i<strlen(str);i++){ str[i]++; } } However, there is no reason to calculate the length of the string on every iteration of the loop. Thus, we can optimize the function by moving the function call strlen out of the loop: void shift_char(char *str){ int i; int len=strlen(str) for(i=0;i<len;i++){ str[i]++; } } The new version of the function only needs to calculate the value of the function once, saving time in its execution. 2 Loop Unrolling Another common optimization technique is called loop unrolling — this entails repeating the code that will be executed multiple times in a loop, so that fewer loop iterations need to be made. The concept is simple if counterintuitive, since programmers are used to using loops to avoid exactly this sort of repetition (which is why this type of optimization is usually done by the compiler, rather than by humans). Here is a simple example. In the unoptimized program, a while loop is used to call a certain function some finite, but large, number of times. Loops like this one are very common, and you have surely written numerous similar code snippets over the course of your CS career:
  • 2. CS 33 Optimization Techniques in C Fall, 2014 int i = 0; while (i < num) { a_certain_function(i); i++; } There is a problem here, however: loops incur a certain amount of overhead, especially when each iteration does only a small amount of work (such as calling a single function). The conditional branch at the top of the loop, and the unconditional return jump at the bottom, are both operations that take a non-trivial number of cycles to complete, and thus slow the program down, albeit by a small amount. Thus, assuming that num is a multiple of 4, the following optimized code might be preferable to the original: int i = 0; while (i < num) { a_certain_function(i); a_certain_function(i+1); a_certain_function(i+2); a_certain_function(i+3); i += 4; } In this version, although the actual code visible to the user (and the binary) is longer, the program will take less time to run, because it need only loop num/4 times to compute the same results as the original version. Loop unrolling does have one major drawback: it makes programs significantly longer (in terms of lines of code and size of the compiled binary) than they would be otherwise. As in many cases in computer science, the decision of whether to unroll or not to unroll is a tradeoff between time and program size. For some applications, the added binary length might be worth the decreased runtime; however, there are also applications for which a smaller binary is more important than a shorter runtime (for example, if one were coding for a machine with very little RAM, or were writing a program that was already very large). 3 Inlining Inlining is the process by which the contents of a function are “inlined” — basically, copied and pasted — instead of a traditional call to that function. This form of optimization avoids the overhead of function calls, by eliminating the need to jump, create a new stack frame, and reverse this process at the end of the function. For example, consider the following piece of code: 2
  • 3. CS 33 Optimization Techniques in C Fall, 2014 int get_random_number(void) { return 4; // chosen by fair die roll // guaranteed to be random } int main(int argc, char **argv) { int a = get_random_number(); printf("%dn", a); return 0; } Making a whole function call just to retrieve the number 4 seems a bit inefficient, and gcc would agree with you on that. To optimize this function, the code inside of get random number() (which happens to just be the integer 4) would be substituted for the function call in the body of main(). Thus, after optimization, the function would look like this: int main(int argc, char **argv) { int a = 4; printf("%dn", a); return 0; } A smart compiler might even eliminate the temporary variable a, although for the sake of this example let us assume that it does not. The resultant program does not make any function calls; however, it retrieves the same result as weas provided in the original version. This program will run much faster than the original, given that it doesn’t have to jump, create a new stack frame, and undo all of that time-consuming computation just to get the number 4. Once inlining is brought into the picture, the distinction between “functions” and “macros” may seem a bit fuzzier. However, there remain important semantic differences between the two. Most importantly, a function always defines its own scope — that is, local variables defined within a function may only be accessed from within that function (in C, this is true of any block of code surrounded by curly braces). When inlining more complex functions with lots of local variables, things get much more complex. We won’t get into the specifics here (this is a systems course, not a semantics course), but let it suffice to say that the process is a bit more complicated than simply copying and pasting the contents of one function into another. 4 Writing Cache-Friendly Code Another optimization commonly made to programs is cache-friendly code — that is, code whose organization and design utilizes the machine’s cache in the most efficient way possible. Caches, as you have seen in lecture, store lines containing bytes that are located consecutively in main memory. This design is based on the principle of spatial locality — that is, those memory regions that are physically closer together are more likely to be accessed within a short time of one another than those spread more thinly in RAM. In order to write cache-friendly code, a programmer must respect this principle, by writing programs that primarily access closely-grouped memory locations sequentially or simultaneously. 3
  • 4. CS 33 Optimization Techniques in C Fall, 2014 For example, the following code will visit every entry in a two-dimensional array, and add 1 to each: for (j = 0; j < NUM_COLS; j++) { for (i = 0; i < NUM_ROWS; i++) { array[i][j] += 1; } } However, there is a problem here: two-dimensional arrays in C are actually stored as one-dimensional arrays, with rows intact and listed one after the other, such that the underlying index of an ele- ment can be computed by row * rowlen + col (In technical terminology, this is called row-major order). When we iterate through the array in the above example, however, we iterate in column- major order1. Unless our cache lines are very large, or the array is very small, this poses a problem, because sequential memory accesses will each be separated from one another by NUM COLS. Thus, if NUM COLS is greater than the length of a cache line, and NUM ROWS is greater than the number of lines in the cache, we run into the following problem: On each of the first n iterations (where n is the number of lines in the cache), the program suffers a cache miss, since the next byte needed is not in any of the cache lines that are present in memory. This is not a huge problem because we started with a cold cache, and some number of misses were bound to be required in order to warm it up. However, on the next iteration, we again suffer a miss, and since the cache is now full, one of the current cache lines must be evicted to make room for the new line. Since each read requests an address that is on a different line and there are not enough lines to store the entire array in the cache, a miss occurs on every iteration, slowing the program down considerably. This is unacceptable — if we’re to have a miss each time we access memory, we might as well not have the cache at all. Luckily, there is a simple fix: reverse the order in which the rows and columns of the array are traversed: for (i = 0; i < NUM_ROWS; i++) { for (j = 0; j < NUM_COLS; j++) { array[i][j] += 1; } } Now we are iterating through the columns on the inside loop, and the rows on the outside. Turning our attention back to the cache, we see a different pattern: The first memory access is a miss, since the cache is cold. However, the next n − 1 accesses are hits, since the requisite values were loaded into the cache with the first as part of the cache line. Only then do we encounter another miss. This pattern continues, and misses become much less frequent than before. 5 Cache Blocking What happens though when each iteration accesses two non-spatially related elements? For exam- ple, the product of two matrices accesses the rows of the first matrix and the columns of the second matrix. An example of this function is as follows: 1 Which is just like row-major order, except that the columns are intact and listed one after another. 4
  • 5. CS 33 Optimization Techniques in C Fall, 2014 void matrixproduct(double *a, double *b, double *c, int n){ int i,j,k; for(i=0; i<n; i++){ for(j=0; j<n; j++){ for(k=0; k<n; k++){ c[i*n+j] += a[i*n+k]*b[k*n+j]; } } } } This will generate a lot of cache misses like in the previous example. If we assume the cache block to be 8 doubles in size, after the first iteration, there are n/8 + n cache misses. This is from the fact that the entire column of the second matrix is traversed. If we were to look at the cache, the last 8 doubles of the first matrix would be contained. Each cache miss results in loading a single block into the cache, the size of which is 8 doubles. Thus, there are n/8 cache misses. However, for the second matrix, 8 doubles from each of the last C rows would be contained where C is the size of the cache. Since every row of the column would not be contained in the cache, there are n cache misses. With n2 iterations, there are on the order of n3 cache misses! It would be nice if we could take advantage rows of the second matrix despite traversing the columns. Luckily, the matrix product algorithm lends itself to a blocking solution. That is, instead of dealing with rows of the first matrix and columns of the second matrix, we deal with a block of the first matrix and a block of the second matrix. Since the operations are communicative, the order doesn’t matter. An example of a blocked matrix product is as follows: void matrixproduct(double *a, double *b, double *c, int n){ int i,j,k; for(i=0; i<n; i+=B){ for(j=0; j<n; j+=B){ for(k=0; k<n; k+B){ //Dealing with the product of the block for(i2=i; i2<i+B; i2++){ for(j2=j; j2<j+B; j2++){ for(k2=k; k2<k+B; k2++){ c[i2*n+j2] += a[i2*n+k2]*b[k2*n+j2]; } } } } } } } We will make the same cache size assumptions and assume that around 3 blocks fit into the cache. Since each block is of size BxB, if we are computing the product of 2 blocks we need 3B2 space in the cache to hold the multiplicand, the multiplier, and the product, with 3B2 < C. (This assumes 5
  • 6. CS 33 Optimization Techniques in C Fall, 2014 the units of B and C are bytes.) Since a cache line is 64 byes, each line holds 8 doubles. Thus the first access of a cache-line-aligned block results in a miss, but brings in 8 doubles. And thus 1/8 of the accesses to matrix elements (each of which is a double) result in a miss, or B2 8 misses per block. In each iteration, 2n B blocks are examined (n b for the multiplicand and n b for the multiplier). Since each block results in B2 8 misses, there are (2n B )(B2 8 ) = nB 4 misses/iteration. Another improvement is that we now have fewer iterations since we are dealing with a whole block of data at once. We have ( n B )2 iterations leading to a total of ( n B )2(nB 4 ) = 1 4B n3 cache misses. By using the largest possible block size for our cache, we can see that we get far fewer cache missses by blocking. Blocked solutions are not always easy to see or create but when used can make code much faster. Complete knowledge of the algorithms used are needed in order to make the most efficient code. 6 Writing Pipeline-Friendly Code Finally, optimized programs must take into account how the machine’s pipeline will affect their performance. As you have seen in lecture, the nature of a pipeline makes it such that, after some instructions (such as conditional branches), it is impossible to predict with perfect accuracy which instruction will be executed next. However, since knowing the following instruction, even some of the time, helps with efficiency, processor designers make every effort to anticipate the next instruction as often as possible in these cases. Branch prediction, however, only gets it right so often, so it’s up to the programmer to make sure that it is easy for the processor to predict branches. There are a number of different branch prediction schemes, which are detailed in your book and in lecture. However, as was mentioned in the section on loop unrolling, the best solution is to avoid branching as much as possible, since mispredicted branches are fairly expensive, and it is difficult to know the branch prediction scheme of the machine on which your code will be run. In many processors, the branch predictor is good enough that there is very little the programmer can do to help — thus, while you shouldn’t go to too great lengths to avoid branches, you should be cognizant of where they appear in your code and ask yourself if they are unnecessary. 6