User Tools

Site Tools


spo600:compiler_optimizations

Compiler Optimizations

Compiler optimizations are alterations made to code by a compiler to achieve the same result as the original input code but with improved performance. This can mean reduced code size, reduced execution size, or improved execution speed; often, these goals are in conflict and optimizing for one (such as speed) will come at the cost of another (such as code size).

Why It's Important to Understand Compiler Optimizations

As programmers, understanding the optimizations performed by our tools enables us to trust the tools to transform code that is readable and logical to our way of thinking into code that performs well. It also enables us to focus our manual optimization efforts where they will have the most impact, such as on algorithm selection and data representation/variable type selection.

GCC Optimization Options

GCC offers a large number of optimization options, most of which can be controlled in combination using the -O flag. This option accepts a value from 0 (no optimization) through 3 (highest optimization), s (optimize for size), fast (optimize for speed only), or g (optimize for debugging experience - avoid optimizations which convolute debugging).

The -O actually selects a set of features. These features may also be individually selected using the -f flag, followed by the feature name; for example, -funroll_loops enables the unrolling of loops (note: other features also affect loop unrolling). To disable a feature, use the -f flag followed by no- and the feature name; for example, -fno-unroll-loops turns off loop unrolling.

To see the individual features enabled by a particular -O flag, or combination of -O and feature flags, use the -Q –help=optimizers flags, which will query the optimization feature list. For example, to see all of the optimizations enabled/disabled at -O2, use this command:

gcc -O2 -Q –help=optimizers

The GCC documentation – both the man page and even more so the online manuals – have good information about the 200+ optimization features of the GCC compilers.

LLVM Optimization Options

LLVM can perform an extensive set of analysis and transform operations (optimizations). See the llvm documentation for details.

Examples of Common Optimizations

Here are some common C optimizations. For each optimization, an input code snippet is shown, along with the code showing the equivalent of the optimization.

Important notes:

  • There are many other kinds of optimizations, some of which operate at the object code level.
  • Many of these optimizations cannot be performed when code has a side effect – for example, the direction of a loop can sometimes be reversed without impact if the loop contains only calculations, but cannot be reversed if the loop contains I/O operations (such as printf()) or if a calculation depends on the result of a previous calculation (common in cryptography). Tracking these conditions can become complex!
  • To make decisions on whether to apply some specific optimizations, as well as to provide information (such as code for inlining) required to apply optimizations, the compiler requires visibility into all of the code in the program. This is often not available if some of the code exists in a library that was previously compiled, or is contained in another compilation unit (e.g., compiled into a .o file by a separate compilation step). To enable more extensive optimization, Link Time Optimization (LTO) can be used.
  • Many optimization decisions are based on a cost-benefit analysis. For example, optimizations such as loop unrolling will often increase performance at the cost of larger code size. In some cases, the usage pattern of the code, such as loop counts and procedure call frequency, is relevant to optimization decisions. Profile-Guided Optimization (PGO) can assist in making these determinations.

Code Rewriting Optimizations

These optimizations involve rewriting code for performance or space.

Although these examples are shown using rewritten source code, the compiler actually applies these optimizations to an intermediate representation of the program which does not closely resemble the original source.

Dead Code Elimination

Some code cannot be reached in order to be executed. For example:

 int y=(int) x * 2;
 if (x % 2 == 0) {
   printf("This code cannot be reached.\n");
 }

This “if” code will be completely eliminated by the compiler because it is “dead code” - it cannot be reached in normal program execution.

Another example of dead code may include stores performed to variables that are never read.

Dead code is intentionally used in some situations. For example, in code for embedded devices, there may be debug statements like this:

  #define DEBUG 1
  #define DEBUG_PRINTF if (DEBUG) serial.printf
  ...
  DEBUG_PRINTF("Threshold value at %d mS: %d\n", milli, threshold);
  ...
  DEBUG_PRINTF("Entering sleep mode.\n");

Each DEBUG_PRINTF statement in the code will output messages to the diagnostic serial port on the device. There may be hundreds of these statements in the code. Once the software has been thoroughly tested, the macro definition of DEBUG will be changed to 0, and the compiler will eliminate all of the DEBUG_PRINTF code when the program is compiled.

Strength Reduction

Certain operation are expensive in terms of the processor time they consume. Replacing these expensive operations with cheaper (simpler, faster) operations is called strength reduction.

This is a simple loop that counts to ten; each iteration displays the loop index multiplied by 6:

 int x;
 for (x=0; x < 10; x++) {
     printf("%d\n", x*6);
 }

By adding six to the loop index instead of one after each iteration, the slow multiplication can be eliminated from the loop; on most CPUs, the addition takes the same amount of time as the increment in the code above:

 int x;
 for (x=0; x < 60; x+=6) {
     printf("%d", x);
 }

Hoisting

Hoisting involves moving operations outside of a loop.

Hoisting I - Loop-Invariant Variable

Here is a simple loop:

 int t, x;
 double c;
 t = readtemp(); /* current temp in farenheit */
 for (x = 0; x < 200; x++) {
     c = (t-32)/1.8000 + 273.15;
     foo(x,c);
 }

The value of c is invariant within the loop, so the calculation of its value may be safely moved outside the loop:

 int t, x;
 double c;
 t = readtemp(); /* current temp in farenheit */
 c = (t-32)/1.8000 + 273.15;
 for (x = 0; x < 200; x++) {
     foo(x,c);
 }
Hoisting II - Loop-Invariant Expression

Here is a similar example:

 int t, x;
 t = readtemp(); /* current temp in farenheit */
 for (x = 0; x < 200; x++) {
     foo(x,(t-32)/1.8000 + 273.15);
 }

In this case, there is no variable with a loop-invariant value, but the second argument to the function foo() is invariant and can be hoisted out of the loop:

 int t, x;
 double c;
 t = readtemp(); /* current temp in farenheit */
 c = (t-32)/1.8000 + 273.15;
 for (x = 0; x < 200; x++) {
     foo(x,c);
 }
Hoisting III - Loop-Invariant Expression in Loop Condition

In this example, the invariant expression is in the loop control condition:

 int x, y;
 y = foo();
 for (x=0; x < y*12; x++) {
     bar(x);
 }

Hoisting this out of the loop will reduce the number of multiplications performed:

 int x, y, c;
 y = foo();
 c = y * 12;
 for (x=0; x < c; x++) {
     bar(x);
 }

Pre-calculation of Constants

Where possible, the compiler will evaluate constant expressions at compile time rather than runtime:

 ff = (212-32)/100;   /* factor for celcius-farenheit conversion */
 conv = c * ff + 32;

The factor ff will never change, so this code could be rewritten as:

 conv = c * 1.800 + 32;

Loop Unswitching

It is sometimes possible to swap a condition inside a loop for loops inside a condition.

Loop Unswitching I - Inner/Outer Swap

Consider this code:

 int foo(float ctl) {
     int x;
     for (x=0; x < 10000; x++) {
         if (ctl == 0) {
             bar(x);
         } else {
             qux(x);
         }
     }
 }

This code evaluates the condition ctrl == 0 ten thousand times, but that condition will never change within the loop. If the loop is rewritten as a pair of loops inside an outer if statement, the condition can be evaluated just once, at the expense of larger code:

 int foo(float ctl) {
     int x;
     if (ctl == 0) {
         for (x=0; x < 10000; x++) {
             bar(x);
         }
     } else {
         for (x=0; x < 10000; x++) {
             qux(x);
         }
     }
 }
Loop Unswitching II - Inner/Outer Swap with Code Repetition

This example is similar. Here, the loop contents are partially-conditional:

 int foo(float ctl) {
     int x;
     for (x=0; x < 10000; x++) {
         bar(x);
         if (ctl == 0) {
             qux(x);
         }
     }
 }

Constructing two loops and placing them in a conditional structure removes the condition evaluation from inside the loop – again, at the expense of size:

 int foo(float ctl) {
     int x;
     if (ctl == 0) {
         for (x=0; x < 10000; x++) {
             bar(x);
             qux(x);
         }
     } else {
         for (x=0; x < 10000; x++) {
             bar(x);
         }
     }
 }

Loop Splitting

Similar to loop unswitching, loop splitting will split a loop into two if two parts of the loop are handled differently:

 int foo(float ctl) {
     int x;
     for (x=0; x < 10000; x++) {
         if (x < 3700) {
             bar(x);
         } else {
             qux(x);
         }
     }
 }

Becomes:

 int foo(float ctl) {
     int x;
     for (x=0; x < 3700; x++) {
             bar(x);
     }
     for (; x < 10000; x++) {
             qux(x);
     }
 }

Loop Interchange

Swapping an inner and outer loop can sometimes improve performance. This code iterates through a two-dimensional array:

 int x, y, max=10000;
 long double a[[max]][max];
 for (y=0; y < max; y++) {
     for (x=0; x<max; x++) {
         a[[x]][y]=foo(x,y);
     }
 }

If the data is stored in memory in row-major format (all of row 0 is followed by all of row 1), then the memory accesses in the loop above are spread across memory. This reduces the effectiveness of pre-fetching and caching memory.

Swapping the inner and outer loops achieves the same result, but causes the memory access to be sequential:

 int x, y, max=10000;
 long double a[[max]][max];
 for (x=0; x < max; x++) {
     for (y=0; y<max; y++) {
         a[[x]][y]=foo(x,y);
     }
 }

Loop Unrolling

A loop can be unrolled so that the inner portion of the loop contains multiple copies of the code (corresponding to multiple iterations of the loop). This takes additional space, but reduces the number of times that the loop control condition is evaluated.

Loop Unrolling I - Guaranteed-Multiple Iterations

This loop is guaranteed to execute an even number of times:

 int x, y;
 y=foo();
 for (x = 0; x < y*2; x++) {
     bar(x);
 }

It can be unrolled so that the loop expression is evaluated only every second time that the loop contents are executed:

 int x, y;
 y=foo();
 for (x = 0; x < y*2; ) {
     bar(x++);
     bar(x++);
 }
Loop Unrolling II - Pairs-of-Iterations plus a Conditional Extra Iteration

The unrolling performed in the previous example can be performed even when the number of iterations is not guaranteed to be even, if you add an additional conditionally-executed iteration after the loop.

Consider this code:

 int x, y;
 y=foo();
 for (x=0; x < y; x++) {
     bar(x);
 }

If we unroll the loop so that it executes an even number of times, we can use an if statement to conditionally execute the loop contents one extra time if an odd number of iterations are required:

 int x, y;
 y=foo();
 for (x=0; x < y-1; ) {
     bar(x++);
     bar(x++);
 }
 if (x < y ) {
     bar(x++);
 }
Loop Unrolling III - Large Number of Iterations

You can extend this principle to a larger number of unrolled iterations:

 int x, y;
 y=foo();
 for (x=0; x < y; x++) {
     bar(x);
 }

This code evaluates the loop control condition only one-tenth as often as the original loop, up to the largest multiple of ten iterations that is less than the total iteration count:

 int x, y;
 y=foo();
 for (x=0; x < y-10; ) {
     bar(x++);
     bar(x++);
     bar(x++);
     bar(x++);
     bar(x++);
     bar(x++);
     bar(x++);
     bar(x++);
     bar(x++);
     bar(x++);
 }
 for (; x < y; ) {
     bar(x++);
 }

Loop unrolling is often beneficial as long as the unrolled loop fits into cache; unrolling past that point will reduce performance.

Inlining

Inlining takes code from a function and places it directly into the calling routine, eliminating the function call.

 int foo(int a, int b) {
     return a + b * 2;
 }
 int x, i;
 long ttl=0;
 for (x = 0; x < 10000; x++) {
     i = rand();
     ttl += foo(x, i);
 }

Using inlining, the calls to the function foo() can be eliminated:

 int x, i;
 long ttl=0;
 for (x = 0; x < 10000; x++) {
     i = rand();
     ttl += x + i*2;
 }

Exercise: Can you see an additional optimization (besides loop unrolling) that could be performed in the example above?

Common Subexpression Elimination

Evaluating and saving the result for a common subexpression can reduce execution time. Consider this expression:

 x = (a * c) - (a * c) / 12

The subexpression (a * c) is evaluated twice. It could be evaluated once and used twice, saving an expensive multiplication:

 t = (a * c)
 x = t - t/12

Jump Threading

Jump threading involves eliminating unnecessary condition evaluation by recognizing conditions that are in alternation.

 int a, b;
 a=foo();
 b=!a;
 if (a) {
     bar();
 }
 if (b) {
     baz();
 }

The variable b and the second condition are unnecessary and can be eliminated:

 int a, b;
 a=foo();
 if (a) {
     bar();
 } else {
     baz();
 }

Short-Circuit Evaluation

Short Circuit I - AND

When evaluating a condition with a logical AND (&&), both halves of the condition must be true for the entire condition to be true. Therefore, it is not necessary to evaluate the second half of the condition if the first half is false:

 if (a * b > c && d > e) {
     foo();
 }

You could rewrite this as:

 if (a * b > c) {
     if (d > e) {
         foo();
     }
 }
Short Circuit II - OR

In the case of a logical OR (||), the short-circuit permits you to skip evaluation of the second half of the condition if the first half is true.

 if (a * b > c || d > e ) {
     foo();
 }

Could be rewritten as:

 // This makes more sense in assembler than C!
 if (a * b > c) {
     foo();
 } elseif (d > e) {
     foo();
 }

Test Reordering

In a short-circuit expression, you can perform a partial strength reduction by placing the least expensive expression first:

 if (a * b > c && d > e) {
     foo();
 }

When rewriting this in short-circuit form, the expression a * b > c contains a multiplication, while the expression d > e does not – so the simpler expression should be checked first:

 if (d > e) {
     if (a * b > c) {
         foo();
     }
 }

Dead Store Elimination

A dead store occurs when a value is stored to a variable and never read again, often because the value is overwritten before being read.

Dead Store Elimination I - Simple Case

Here is a simple dead store example:

 a = b * c + d / f;
 a = c / f * 60;

This looks stupid, but is actually fairly common in programs, especially if there are many lines of code between the two assignments.

This can be written as:

 a = c / f * 60;
Dead Store Elimination II - Unused Initialized Value in Declaration

A common source of dead stores is unused initialized values from declarations:

 int a = 0;
 /* ...possibly many lines later, before a is read: */
 a = foo();

Eliminating the dead store saves a memory access:

 int a;
 a = foo();
Dead Store Elimination III - Special Case

Some dead stores are the result of overriding a variable value in a special case:

 int a;
 a = b * c + d / f;     /* normal/default case */
 if (foo()) {
     a = c / f * 60;    /* special case! */
 }

This can be written so that only one store is ever performed:

 int a;
 if (foo()) {
     a = c / f * 60;
 } else {
     a = b * c + d / f;
 }

Machine-Code Level Optimization

In addition to code rewriting, there are optimizations that can be performed at the machine language level – the actual instructions that are emitted by the compiler and the arrangement of these instructions in memory. Here are some examples:

Block Rearrangement

In a block of code such as this:

 if (condition) {
   action
  }
  following code

The compiler usually inverts the condition test and converts it into a conditional branch/jump – if the condition is FALSE, then the CPU branches to the following code.

There are many cases where the condition is likely to be false almost all of the time. For example, this test could be for an error condition, a very rare data case, or some kind of sanity check.

In a normal cache-load and prefetch pattern, the code for the conditional action is loaded into cache and the early stages of the processor's pipeline system and may even be speculatively executed while the condition is evaluated. If the condition is almost always false, its would be more efficient to load the following code instead of the action code, so the compiler will rearrange the code blocks to place the action code into another area of memory.

In order to do this, the compiler needs to know that the condition is almost always false. The programmer can state this using Compiler Intrinsics such as the GCC __builtin_expect function, or the compiler can discover this by code analysis or PGO.

(Note that in the Linux Kernel, the __builtin_expect function is used by the friendlier likely() and unlikely() macros; this article on Kernel Newbies explains how this works. Other codebases may use a similar approach).

Instruction Selection

There are many optimizations that occur at the object code level. Selecting an instruction that provides the same result as another instruction but which takes less time to execute, called instruction selection, is a simple and common object code optimization.

In order to perform instruction selection, the compiler relies upon a Cost Model, which contain detailed information about the relative execution cost of each instruction which the target can execute. Execution costs can vary between different implementations of an architecture (e.g., cpu models) due to varying numbers and efficiencies of execution units, pipeline size, cache, memory interface details, and so forth, so most compilers accepts both a target architecture and a specific chip model (for example, -march and -mtune in gcc).

The x86 instruction set in particular has some quirks that the chip designers have maintained over the years, including the fastest way to zero out a register.

Debugging with Optimizations Enabled

Using low-level debugging tools with code that has been highly optimized can be very challenging, because the object code may bear little resemblance to the source code. However, some errors may only surface when optimization is enabled - for example, optimized code may perform an operation before a device is ready to receive data. Debugging the unoptimized code may not reveal the problem, because the extra execution time may eliminate the error.

The gcc option -Og attempts to balance optimization with the debugging experience, by enabling only optimizations that will not excessively convolute the debugging process.

Resources

spo600/compiler_optimizations.txt · Last modified: 2024/10/08 17:52 by chris

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki