Writing Efficient Code Feb 08


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Writing Efficient Code Feb 08

  1. 1. Insight Insight Understand the Machine to Write Efficient Code How many programmers can actually write assembly programs? With the rising popularity of high-level languages (like Java, VB.Net, etc), there is rarely any need for programmers to learn assembly or low-level programming. However, there are domains where writing efficient code is very important, for example, game programming and scientific computing. W hen can we write highly for alternatives to write efficient code. efficient code? It is when we We can write in low-level programming understand how the underlying languages like C to get write code, whose machine works and make best efficiency is often comparable to the use of that knowledge. One equivalent code written in assembly. For this well-known way to write highly efficient code reason, C is often referred as a ‘high-level is to write code in assembly. There are many assembler’. In this article, we’ll look at various disadvantages with this; for example, we programming constructs from the perspective cannot port the programs easily, it is difficult of efficiency. We’ll consider general machine to maintain the code, etc. So, we need to look architecture for illustration; and for specific 12 February 2008 | LINuX For you | www.openITis.com cmyk
  2. 2. Insight examples, x86 architecture will be floating point division operation toggling the case of characters used. A word of caution before we might take 50 to 100 cycles. Memory in a string, then it is not a good proceed: the techniques and issues access operations are very slow—if implementation. covered are not for general-purpose the desired memory location is not in The code can be improved programming. cache, then it might take hundreds of as follows: since the comparison cycles to fetch the data from the main operators are not required, the Basic types memory. function precondition says that The machine, in general, ch passed to the function is in the understands only three types of Operators given range [‘a’-‘z’] or [‘A’-‘Z’]. Based values: address, integer and floating- C supports a rich set of operators. on C tradition, we need not check point values. For representation There are few operators that are to ensure that the given char is and manipulation, here is the directly supported by the processor in fact in this range. Also, since correspondence between the types and there are a few that are simulated we are performing either the ‘-’ or that the machine can understand in the software. ‘+’ arithmetic operation, we can and what C supports. Addresses For integral types, bit- replace it with bit-wise operations correspond to the pointer construct; manipulation operators are faster for toggling the bit using the ex- integers—both signed and compared to other operators like or operator. With this the code unsigned—correspond to short, int, arithmetic, logical or relational becomes efficient and simple: long, long long, char (yes, a char is operators. One of the ways to write represented as an int internally!) etc; efficient code is to write code using // precondition: the char ch provided floating-point types correspond to bitwise operations instead of other is in range [‘a’-‘z’] or [‘A’-‘Z’] float, double, long double, etc. slower operations. Here is a well- char toggle_ascii_char_case(char ch) The most efficient data-type known example: Using ‘<<’ is more { that a processor handles is a ‘word’, operator efficient than dividing an return (ch ^= 0x20); which corresponds to ‘int’ type integer value by 2. We’ll look at a } in C. For floating-point types, all different example for illustration computation is typically done in here. This example is just for a larger floating-point type. For A typical code segment for illustration purposes. Using bit-wise example, in x86 machines, all toggling a character’s case is to use operators obscures the code, but it floating-point computation is done relational operators, as in: usually significantly improves the in ‘extended precision’, which is 80 efficiency of the code. bits in size and usually corresponds // precondition: the char ch provided to ‘long double’ in C; if floating point is in range [‘a’-‘z’] or [‘A’-‘Z’] Control flow expressions are used in the code, char toggle_ascii_char_case(char ch) { C has various conditional and they are internally converted to if( (ch >= ‘a’) && (ch <= ‘z’) looping constructs. A C compiler extended precision by the processor ) // lower case transforms such code constructs and the results are converted back ch = ch - 0x20; to branching (also known as to float (which occupies 32 bits). else if( (ch >= ‘A’) && (ch <= ‘jump’) instructions. So, goto is the The processor does floating-point ‘Z’) ) // upper case most straightforward construct computations in a separate ‘co- ch = ch + 0x20; for programming. It is possible to processor’ unit. return ch; take any C program and create an Address computation (such } equivalent program by removing all as array index access) is done conditions and loops with just goto using pointers in C. They directly The code works on the following statements. Though it is ultimately correspond to memory access assumption: the given char ch is branching instructions, there are operations in the underlying machine within the range [a-z] or [A-Z]. If subtle differences in constructs (such as index addressing, in this the char is [A-Z], it returns the when it comes to efficiency. case). corresponding char in [a-z] and vice Which one is more efficient— The ‘int’ type is the most efficient versa, that is, it toggles the case of nested if conditions or switch for computations. Unsigned types the character. The value 0x20 is statements? In general, a switch and operations on that are as added or subtracted based on the fact is more efficient than nested if efficient as signed types. Floating that the alphabetic characters are statements. If both are implemented point types and operations are slow separated by the hex value 0x20 in using branching, why is switch more compared to integral types. For the ASCII table. efficient than nested if conditions? example, in an imaginary processor, But this function is slow. If Recall that, in a switch statement, if integer division takes four cycles, this is a library function used for all cases are constants. So, a www.openITis.com | LINuX For you | February 2008 13 cmyk
  3. 3. Insight compiler can transform a switch more efficient? it is not possible to take address statement to a range or look-up of specific bits in a byte. Though it table, which is more efficient than a for(i = 0; i< 50; i++) is space-efficient to use bit-fields, long list of jumps. for(j = 0; j< 50; j++) it is not time-efficient since it is Note that executing jump for(k = 0; k< 50; k++) not possible to access individual instructions is not costly, but printf(“%d ”, bits; so the compiler emits code to unpredictable jumps can result a[k][j][i]); access ‘word’s and then does bit- in considerably slower execution. manipulation to access individual For example, frequent jumps can for(i = 0; i< 50; i++) bit-field member values. result in the flushing of the pipeline. for(j = 0; j< 50; j++) The following bit-field struct is to Similarly, processors typically look for(k = 0; k< 50; k++) represent time in a day in HH:MM: ahead in the instruction stream printf(“%d ”, SS format. For the hour, the range is and pre-fetch necessary memory a[i][j][k]); 0-23 and for minutes and seconds, accesses and put it in cache. So, the range is 0-59. We can use the unpredictable jumps can result in C has arrays implemented in following struct: memory faults, which will result in row-major order, that is, the same wasting hundreds of cycles since way it is organised in the hardware. struct time { the processor has to wait for the The second loop is more efficient unsigned int hour : 5; memory value to be available for it because it accesses memory unsigned int minute: 6; to continue execution. In general, locations sequentially, in row-major unsigned int second: 6; a program with less number of order. The processor will fetch the } tm1, tm2; branches is faster than those that memory blocks into cache, and have a large number of branches. since the memory access is also To access tm.minute, the sequential, this is efficient. However, compiler has to generate code to Memory access in the first loop, the memory access access the word (4 bytes) and do bit- As said earlier, memory access is not sequential and hence there manipulation and access only the 5th is a costly operation. Let us take might be a lot of memory faults and to 10th bit in that word, which is slow. a specific example to illustrate hence it will be considerably slower So, avoid using bit-fields extensively this. Typically, it takes the same than the second loop. if performance is important for your time to access global data or local From these examples we learn software. A better option in this case (stack allocated) data. However, that it is important to keep in mind is to use a struct (without bit-fields) it is preferable to use local data that memory faults are costly and with a byte each for the hour, minute instead of global data because of the we need to minimise such memory and second, respectively. well-known ‘principle of locality’. faults to write efficient code. In this article, we explored some If a memory location is accessed, fundamental issues to understand the processor doesn’t fetch value Compound types how various programming constructs in just that memory location; it C supports compound types like can affect the efficiency of programs. fetches many values adjacent to that structs, unions and bit-fields that There are many other issues, such memory location since the program are implemented in terms of other as unaligned memory access, cost is likely to access variables that are primitive or compound types. The of I/O operations, etc, that are not located near that memory location. hardware does not understand any covered in this article. This article is Fetching a block of data and putting compound types, and all processing just a starting point to understand it in cache is not time consuming, is done on primitive types only. such problems. If you are interested, but if there is a memory fault, it can There are many aspects—such you can read books on assembly lead to the processor waiting for as padding and alignment—in language, computer architecture and hundreds of cycles for that memory using compound types that can compiler optimisation to get a better access to happen. If there are large affect the performance. Here, we’ll understanding of the issues related to numbers of global variables and their look at an example of bit-fields to writing efficient programs. accesses are spread throughout understand how it is supported by the program, then the program the hardware. By: S G Ganesh is a research execution becomes considerably In C, we can manipulate and engineer in Siemens (Corporate slower. access bits using bit-fields. It is a Technology), Bangalore. His latest Let us consider another (well- syntax error to attempt taking the book is, ‘60 Tips on Object Oriented known) example to illustrate this address of a bit-field member. Why? Programming’, published by Tata important ‘principle of locality’. Of The granularity of addressing in McGraw-Hill in December 2007. You can reach him at sgganesh@gmail.com the following two loops, which one is modern computers is in bytes and 14 February 2008 | LINuX For you | www.openITis.com cmyk