3 Optimizing Your Code

TMS320C28x Optimizing C/C++ Compiler v15.9.0.STS User's Guide
SPRU514 - REVISED SEPTEMBER, 2015

3 Optimizing Your Code

The compiler tools can perform many optimizations to improve the execution speed and reduce the size of C and C++ programs by simplifying loops, rearranging statements and expressions, and allocating variables into registers.

This chapter describes how to invoke different levels of optimization and describes which optimizations are performed at each level. This chapter also describes how you can use the Interlist feature when performing optimization and how you can profile or debug optimized code.

3.1 Invoking Optimization

The C/C++ compiler is able to perform various optimizations. High-level optimizations are performed in the optimizer and low-level, target-specific optimizations occur in the code generator. Use high-level optimization levels, such as --opt_level=2 and --opt_level=3, to achieve optimal code.

The easiest way to invoke optimization is to use the compiler program, specifying the --opt_level=n option on the compiler command line. You can use -On to alias the --opt_level option. The n denotes the level of optimization (0, 1, 2, 3, and 4), which controls the type and degree of optimization.

--opt_level=off or -Ooff
- Performs no optimization
--opt_level=0 or -O0
- Performs control-flow-graph simplification
- Allocates variables to registers
- Performs loop rotation
- Eliminates unused code
- Simplifies expressions and statements
- Expands calls to functions declared inline
--opt_level=1 or -O1

Performs all --opt_level=0 (-O0) optimizations, plus:

Performs local copy/constant propagation
Removes unused assignments
Eliminates local common expressions

--opt_level=2 or -O2

Performs all --opt_level=1 (-O1) optimizations, plus:

Performs loop optimizations
Eliminates global common subexpressions
Eliminates global unused assignments
Performs loop unrolling

--opt_level=3 or -O3

Performs all --opt_level=2 (-O2) optimizations, plus:

Removes all functions that are never called
Simplifies functions with return values that are never used
Inlines calls to small functions
Reorders function declarations; the called functions attributes are known when the caller is optimized
Propagates arguments into function bodies when all calls pass the same value in the same argument position
Identifies file-level variable characteristics

If you use --opt_level=3 (-O3), see Section 3.2 and Section 3.3 for more information.

--opt_level=4 or -O4

Performs link-time optimization. See Section 3.4 for details.

By default, debugging is enabled and the default optimization level is unaffected by the generation of debug information. However, the optimization level used is affected by whether or not the command line includes the -g (--symdebug:dwarf) option and the --opt_level option as shown in the following table:

Table 3-1 Interaction Between Debugging and Optimization Options

Optimization	no -g	-g
no --opt_level	--opt_level=off	--opt_level=off
--opt_level	--opt_level=2	--opt_level=2
--opt_level=n	optimized as specified	optimized as specified

The levels of optimizations described above are performed by the stand-alone optimization pass. The code generator performs several additional optimizations, particularly processor-specific optimizations. It does so regardless of whether you invoke the optimizer. These optimizations are always enabled, although they are more effective when the optimizer is used.

3.2 Performing File-Level Optimization (--opt_level=3 option)

The --opt_level=3 option (aliased as the -O3 option) instructs the compiler to perform file-level optimization. You can use the --opt_level=3 option alone to perform general file-level optimization, or you can combine it with other options to perform more specific optimizations. The options listed in Table 3-2 work with --opt_level=3 to perform the indicated optimization:

Table 3-2 Options That You Can Use With --opt_level=3

If You ...	Use this Option	See
Have files that redeclare standard library functions	--std_lib_func_defined --std_lib_func_redefined	Section 3.2.1
Want to create an optimization information file	--gen_opt_level=n	Section 3.2.2
Want to compile multiple source files	--program_level_compile	Section 3.3

3.2.1 Controlling File-Level Optimization (--std_lib_func_def Options)

When you invoke the compiler with --opt_level=3, some of the optimizations use known properties of the standard library functions. If your file redeclares any standard library functions, these optimizations become ineffective. Use Table 3-3 to select the appropriate file-level optimization option.

Table 3-3 Selecting a File-Level Optimization Option

If Your Source File...	Use this Option
Declares a function with the same name as a standard library function	--std_lib_func_redefined
Contains but does not alter functions declared in the standard library	--std_lib_func_defined
Does not alter standard library functions, but you used the --std_lib_func_redefined or --std_lib_func_defined option in a command file or an environment variable. The --std_lib_func_not_defined option restores the default behavior of the optimizer.	--std_lib_func_not_defined

3.2.2 Creating an Optimization Information File (--gen_opt_info Option)

When you invoke the compiler with the --opt_level=3 option, you can use the --gen_opt_info option to create an optimization information file that you can read. The number following the option denotes the level (0, 1, or 2). The resulting file has an .nfo extension. Use Table 3-4 to select the appropriate level to append to the option.

Table 3-4 Selecting a Level for the --gen_opt_info Option

If you...	Use this option
Do not want to produce an information file, but you used the --gen_opt_level=1 or --gen_opt_level=2 option in a command file or an environment variable. The --gen_opt_level=0 option restores the default behavior of the optimizer.	--gen_opt_info=0
Want to produce an optimization information file	--gen_opt_info=1
Want to produce a verbose optimization information file	--gen_opt_info=2

3.3 Program-Level Optimization (--program_level_compile and --opt_level=3 options)

You can specify program-level optimization by using the --program_level_compile option with the --opt_level=3 option (aliased as -O3). (If you use --opt_level=4 (-O4), the --program_level_compile option cannot be used, because link-time optimization provides the same optimization opportunities as program level optimization.)

With program-level optimization, all of your source files are compiled into one intermediate file called a module. The module moves to the optimization and code generation passes of the compiler. Because the compiler can see the entire program, it performs several optimizations that are rarely applied during file-level optimization:

If a particular argument in a function always has the same value, the compiler replaces the argument with the value and passes the value instead of the argument.
If a return value of a function is never used, the compiler deletes the return code in the function.
If a function is not called directly or indirectly by main(), the compiler removes the function.

The --program_level_compile option requires use of --opt_level=3 or higher in order to perform these optimizations.

To see which program-level optimizations the compiler is applying, use the --gen_opt_level=2 option to generate an information file. See Section 3.2.2 for more information.

In Code Composer Studio, when the --program_level_compile option is used, C and C++ files that have the same options are compiled together. However, if any file has a file-specific option that is not selected as a project-wide option, that file is compiled separately. For example, if every C and C++ file in your project has a different set of file-specific options, each is compiled separately, even though program-level optimization has been specified. To compile all C and C++ files together, make sure the files do not have file-specific options. Be aware that compiling C and C++ files together may not be safe if previously you used a file-specific option.

NOTE

Compiling Files With the --program_level_compile and --keep_asm Options

If you compile all files with the --program_level_compile and --keep_asm options, the compiler produces only one .asm file, not one for each corresponding source file.

3.3.1 Controlling Program-Level Optimization (--call_assumptions Option)

You can control program-level optimization, which you invoke with --program_level_compile --opt_level=3, by using the --call_assumptions option. Specifically, the --call_assumptions option indicates if functions in other modules can call a module's external functions or modify a module's external variables. The number following --call_assumptions indicates the level you set for the module that you are allowing to be called or modified. The --opt_level=3 option combines this information with its own file-level analysis to decide whether to treat this module's external function and variable declarations as if they had been declared static. Use Table 3-5 to select the appropriate level to append to the --call_assumptions option.

Table 3-5 Selecting a Level for the --call_assumptions Option

If Your Module …	Use this Option
Has functions that are called from other modules and global variables that are modified in other modules	--call_assumptions=0
Does not have functions that are called by other modules but has global variables that are modified in other modules	--call_assumptions=1
Does not have functions that are called by other modules or global variables that are modified in other modules	--call_assumptions=2
Has functions that are called from other modules but does not have global variables that are modified in other modules	--call_assumptions=3

In certain circumstances, the compiler reverts to a different --call_assumptions level from the one you specified, or it might disable program-level optimization altogether. Table 3-6 lists the combinations of --call_assumptions levels and conditions that cause the compiler to revert to other --call_assumptions levels.

Table 3-6 Special Considerations When Using the --call_assumptions Option

If --call_assumptions is...	Under these Conditions...	Then the --call_assumptions Level...
Not specified	The --opt_level=3 optimization level was specified	Defaults to --call_assumptions=2
Not specified	The compiler sees calls to outside functions under the --opt_level=3 optimization level	Reverts to --call_assumptions=0
Not specified	Main is not defined	Reverts to --call_assumptions=0
--call_assumptions=1 or --call_assumptions=2	No function has main defined as an entry point, and no interrupt functions are defined, and no functions are identified by the FUNC_EXT_CALLED pragma	Reverts to --call_assumptions=0
--call_assumptions=1 or --call_assumptions=2	A main function is defined, or, an interrupt function is defined, or a function is identified by the FUNC_EXT_CALLED pragma	Remains --call_assumptions=1 or --call_assumptions=2
--call_assumptions=3	Any condition	Remains --call_assumptions=3

In some situations when you use --program_level_compile and --opt_level=3, you must use a --call_assumptions option or the FUNC_EXT_CALLED pragma. See Section 3.3.2 for information about these situations.

3.3.2 Optimization Considerations When Mixing C/C++ and Assembly

If you have any assembly functions in your program, you need to exercise caution when using the --program_level_compile option. The compiler recognizes only the C/C++ source code and not any assembly code that might be present. Because the compiler does not recognize the assembly code calls and variable modifications to C/C++ functions, the --program_level_compile option optimizes out those C/C++ functions. To keep these functions, place the FUNC_EXT_CALLED pragma (see Section 6.9.11) before any declaration or reference to a function that you want to keep.

Another approach you can take when you use assembly functions in your program is to use the --call_assumptions=n option with the --program_level_compile and --opt_level=3 options. See Section 3.3.1 for information about the --call_assumptions=n option.

In general, you achieve the best results through judicious use of the FUNC_EXT_CALLED pragma in combination with --program_level_compile --opt_level=3 and --call_assumptions=1 or --call_assumptions=2.

If any of the following situations apply to your application, use the suggested solution:

Situation

Your application consists of C/C++ source code that calls assembly functions. Those assembly functions do not call any C/C++ functions or modify any C/C++ variables.

Solution

Compile with --program_level_compile --opt_level=3 --call_assumptions=2 to tell the compiler that outside functions do not call C/C++ functions or modify C/C++ variables.

If you compile with the --program_level_compile --opt_level=3 options only, the compiler reverts from the default optimization level (--call_assumptions=2) to --call_assumptions=0. The compiler uses --call_assumptions=0, because it presumes that the calls to the assembly language functions that have a definition in C/C++ may call other C/C++ functions or modify C/C++ variables.

Situation

Your application consists of C/C++ source code that calls assembly functions. The assembly language functions do not call C/C++ functions, but they modify C/C++ variables.

Solution

Try both of these solutions and choose the one that works best with your code:

Compile with --program_level_compile --opt_level=3 --call_assumptions=1.
Add the volatile keyword to those variables that may be modified by the assembly functions and compile with --program_level_compile --opt_level=3 --call_assumptions=2.

Situation

Your application consists of C/C++ source code and assembly source code. The assembly functions are interrupt service routines that call C/C++ functions; the C/C++ functions that the assembly functions call are never called from C/C++. These C/C++ functions act like main: they function as entry points into C/C++.

Solution

Add the volatile keyword to the C/C++ variables that may be modified by the interrupts. Then, you can optimize your code in one of these ways:

You achieve the best optimization by applying the FUNC_EXT_CALLED pragma to all of the entry-point functions called from the assembly language interrupts, and then compiling with --program_level_compile --opt_level=3 --call_assumptions=2. Be sure that you use the pragma with all of the entry-point functions. If you do not, the compiler might remove the entry-point functions that are not preceded by the FUNC_EXT_CALLED pragma.
Compile with --program_level_compile --opt_level=3 --call_assumptions=3. Because you do not use the FUNC_EXT_CALLED pragma, you must use the --call_assumptions=3 option, which is less aggressive than the --call_assumptions=2 option, and your optimization may not be as effective.

Keep in mind that if you use --program_level_compile --opt_level=3 without additional options, the compiler removes the C functions that the assembly functions call. Use the FUNC_EXT_CALLED pragma to keep these functions.

3.4 Link-Time Optimization (--opt_level=4 Option)

Link-time optimization is an optimization mode that allows the compiler to have visibility of the entire program. The optimization occurs at link-time instead of compile-time like other optimization levels.

Link-time optimization is invoked by using the --opt_level=4 option. This option must be used in both the compilation and linking steps. At compile time, the compiler embeds an intermediate representation of the file being compiled into the resulting object file. At link-time this representation is extracted from every object file which contains it, and is used to optimize the entire program.

If you use --opt_level=4 (-O4), the --program_level_compile option cannot also be used, because link-time optimization provides the same optimization opportunities as program level optimization (Section 3.3). Link-time optimization provides the following benefits:

Each source file can be compiled separately. One issue with program-level compilation is that it requires all source files to be passed to the compiler at one time. This often requires significant modification of a customer's build process. With link-time optimization, all files can be compiled separately.
References to C/C++ symbols from assembly are handled automatically. When doing program-level compilation, the compiler has no knowledge of whether a symbol is referenced externally. When performing link-time optimization during a final link, the linker can determine which symbols are referenced externally and prevent eliminating them during optimization.
Third party object files can participate in optimization. If a third party vendor provides object files that were compiled with the --opt_level=4 option, those files participate in optimization along with user-generated files. This includes object files supplied as part of the TI run-time support. Object files that were not compiled with –opt_level=4 can still be used in a link that is performing link-time optimization. Those files that were not compiled with –opt_level=4 do not participate in the optimization.
Source files can be compiled with different option sets. With program-level compilation, all source files must be compiled with the same option set. With link-time optimization files can be compiled with different options. If the compiler determines that two options are incompatible, it issues an error.

3.4.1 Option Handling

When performing link-time optimization, source files can be compiled with different options. When possible, the options that were used during compilation are used during link-time optimization. For options which apply at the program level, --auto_inline for instance, the options used to compile the main function are used. If main is not included in link-time optimization, the option set used for the first object file specified on the command line is used. Some options, --opt_for_speed for instance, can affect a wide range of optimizations. For these options, the program-level behavior is derived from main, and the local optimizations are obtained from the original option set.

Some options are incompatible when performing link-time optimization. These are usually options which conflict on the command line as well, but can also be options that cannot be handled during link-time optimization.

3.4.2 Incompatible Types

During a normal link, the linker does not check to make sure that each symbol was declared with the same type in different files. This is not necessary during a normal link. When performing link-time optimization, however, the linker must ensure that all symbols are declared with compatible types in different source files. If a symbol is found which has incompatible types, an error is issued. The rules for compatible types are derived from the C and C++ standards.

3.5 Special Considerations When Using Optimization

The compiler is designed to improve your ANSI/ISO-conforming C and C++ programs while maintaining their correctness. However, when you write code for optimization, you should note the special considerations discussed in the following sections to ensure that your program performs as you intend.

3.5.1 Use Caution With asm Statements in Optimized Code

You must be extremely careful when using asm (inline assembly) statements in optimized code. The compiler rearranges code segments, uses registers freely, and can completely remove variables or expressions. Although the compiler never optimizes out an asm statement (except when it is unreachable), the surrounding environment where the assembly code is inserted can differ significantly from the original C/C++ source code.

It is usually safe to use asm statements to manipulate hardware controls such as interrupt masks, but asm statements that attempt to interface with the C/C++ environment or access C/C++ variables can have unexpected results. After compilation, check the assembly output to make sure your asm statements are correct and maintain the integrity of the program.

3.5.2 Use the Volatile Keyword for Necessary Memory Accesses

The compiler analyzes data flow to avoid memory accesses whenever possible. If you have code that depends on memory accesses exactly as written in the C/C++ code, you must use the volatile keyword to identify these accesses. The compiler does not optimize out any references to volatile variables.

In the following example, the loop waits for a location to be read as 0xFF:

unsigned int *ctrl;
while (*ctrl !=0xFF);

In this example, *ctrl is a loop-invariant expression, so the loop is optimized down to a single memory read. To correct this, declare ctrl as:

volatile unsigned int *ctrl

3.5.2.1 Use Caution When Accessing Aliased Variables

Aliasing occurs when a single object can be accessed in more than one way, such as when two pointers point to the same object or when a pointer points to a named object. Aliasing can disrupt optimization because any indirect reference can refer to another object. The compiler analyzes the code to determine where aliasing can and cannot occur, then optimizes as much as possible while still preserving the correctness of the program. The compiler behaves conservatively.

The compiler assumes that if the address of a local variable is passed to a function, the function might change the local by writing through the pointer, but that it will not make its address available for use elsewhere after returning. For example, the called function cannot assign the local’s address to a global variable or return it. In cases where this assumption is invalid, use the -ma compiler option to force the compiler to assume worst-case aliasing. In worst-case aliasing, any indirect reference (that is, using a pointer) can refer to such a variable.

3.5.2.2 Use the --aliased_variables Option to Indicate That the Following Technique Is Used

The compiler, when invoked with optimization, assumes that any variable whose address is passed as an argument to a function will not be subsequently modified by an alias set up in the called function. Examples include:

Returning the address from a function
Assigning the address to a global

If you use aliases like this in your code, you must use the --aliased_variables option when you are optimizing your code. For example, if your code is similar to this, use the --aliased_variables option:

int *glob_ptr;

g()
{
 int x = 1;
 int *p = f(&x);

 *p = 5; /* p aliases x */
 *glob_ptr = 10; /* glob_ptr aliases x */

 h(x);
}

int *f(int *arg)
{
 glob_ptr = arg;
 return arg;
}

3.5.2.3 On FPU Targets Only: Use restrict Keyword to Indicate That Pointers Are Not Aliased

On FPU targets, with --opt_level=2, the optimizer performs dependency analysis. To help the compiler determine memory dependencies, you can qualify a pointer, reference, or array with the restrict keyword. The restrict keyword is a type qualifier that can be applied to pointers, references, and arrays. Its use represents a guarantee by the programmer that within the scope of the pointer declaration the object pointed to can be accessed only by that pointer. Any violation of this guarantee renders the program undefined. This practice helps the compiler optimize certain sections of code because aliasing information can be more easily determined. This can improve performance and code size, as more FPU operations can be parallelized.

As shown in Example 3-1 and Example 3-2 you can use the restrict keyword to tell the compiler that a and b never point to the same object in foo. Furthermore, the compiler is assured that the objects pointed to by a and b do not overlap in memory.

Example 3-1 Use of the restrict Type Qualifier With Pointers

void foo(float * restrict a, float * restrict b)
{
 /* foo's code here */
}

Example 3-2 Use of the restrict Type Qualifier With Pointers

void foo(float c[restrict], float d[restrict])
{
 /* foo's code here */
}

3.6 Automatic Inline Expansion (--auto_inline Option)

When optimizing with the --opt_level=3 option (aliased as -O3), the compiler automatically inlines small functions. A command-line option, --auto_inline=size, specifies the size threshold. Any function larger than the size threshold is not automatically inlined. You can use the --auto_inline=size option in the following ways:

If you set the size parameter to 0 (--auto_inline=0), automatic inline expansion is disabled.
If you set the size parameter to a nonzero integer, the compiler uses this size threshold as a limit to the size of the functions it automatically inlines. The compiler multiplies the number of times the function is inlined (plus 1 if the function is externally visible and its declaration cannot be safely removed) by the size of the function.

The compiler inlines the function only if the result is less than the size parameter. The compiler measures the size of a function in arbitrary units; however, the optimizer information file (created with the --gen_opt_level=1 or --gen_opt_level=2 option) reports the size of each function in the same units that the --auto_inline option uses.

The --auto_inline=size option controls only the inlining of functions that are not explicitly declared as inline. If you do not use the --auto_inline=size option, the compiler inlines very small functions.

NOTE

Optimization Level 3 and Inlining

In order to turn on automatic inlining, you must use the --opt_level=3 option. If you desire the --opt_level=3 optimizations, but not automatic inlining, use --auto_inline=0 with the --opt_level=3 option.

NOTE

Inlining and Code Size

Expanding functions inline increases code size, especially inlining a function that is called in a number of places. Function inlining is optimal for functions that are called only from a small number of places and for small functions. To prevent increases in code size because of inlining, use the --auto_inline=0 and --no_inlining options. These options, used together, cause the compiler to inline intrinsics only.

3.7 Using the Interlist Feature With Optimization

You control the output of the interlist feature when compiling with optimization (the --opt_level=n or -On option) with the --optimizer_interlist and --c_src_interlist options.

The --optimizer_interlist option interlists compiler comments with assembly source statements.
The --c_src_interlist and --optimizer_interlist options together interlist the compiler comments and the original C/C++ source with the assembly code.

When you use the --optimizer_interlist option with optimization, the interlist feature does not run as a separate pass. Instead, the compiler inserts comments into the code, indicating how the compiler has rearranged and optimized the code. These comments appear in the assembly language file as comments starting with ;**. The C/C++ source code is not interlisted, unless you use the --c_src_interlist option also.

The interlist feature can affect optimized code because it might prevent some optimization from crossing C/C++ statement boundaries. Optimization makes normal source interlisting impractical, because the compiler extensively rearranges your program. Therefore, when you use the --optimizer_interlist option, the compiler writes reconstructed C/C++ statements.

Example 3-4 shows a function that has been compiled with optimization (--opt_level=2) and the --optimizer_interlist option. The assembly file contains compiler comments interlisted with assembly code.

NOTE

Impact on Performance and Code Size

The --c_src_interlist option can have a negative effect on performance and code size.

When you use the --c_src_interlist and --optimizer_interlist options with optimization, the compiler inserts its comments and the interlist feature runs before the assembler, merging the original C/C++ source into the assembly file.

Example 3-5 shows the function from Example 3-4 compiled with the optimization (--opt_level=2) and the --c_src_interlist and --optimizer_interlist options. The assembly file contains compiler comments and C source interlisted with assembly code.

Example 3-3 C Code for Interlist Illustration

int copy (char *str, const char *s, int n)
{
 int i;

 for (i = 0; i < n; i ++)
 *str++ = *s++;
}

Example 3-4 The Function From Example 3-3 Compiled With the -O2 and --optimizer_interlist Options

;***************************************************************
;* FNAME: _copy FR SIZE: 0 *
;* *
;* FUNCTION ENVIRONMENT *
;* *
;* FUNCTION PROPERTIES *
;* 0 Parameter, 0 Auto, 0 SOE *
;***************************************************************

_copy:
;*** 6 ----------------------- if ( n <= 0 ) goto g4;
 CMPB AL,#0 ; |6|
 B L2,LEQ ; |6|
 ; branch occurs ; |6|
;*** ----------------------- #pragma MUST_ITERATE(1, 4294967295, 1)
:*** ----------------------- L$1 = n-1;
 ADDB AL,#-1
 MOVZ AR6,AL
L1:
;*** -----------------------g3:
;*** 7 ----------------------- *str++ = *s++;
;*** 7 ----------------------- if ( (--L$1) != (-1) ) goto g3;
 MOV AL,*XAR5++ ; |7|
 MOV *XAR4++,AL ; |7|
 BANZ L1,AR6--
 ; branch occurs ; |7|
;*** -----------------------g4:
;*** ----------------------- return;
L2:
 LRETR
 ; return occurs

Example 3-5 The Function From Example 3-3 Compiled with the --opt_level=2, --optimizer_interlist, and --c_src_interlist Options

;----------------------------------------------------------------------
; 2 | int copy (char *str, const char *s, int n)
;----------------------------------------------------------------------

;***************************************************************
;* FNAME: _copy FR SIZE: 0 *
;* *
;* FUNCTION ENVIRONMENT *
;* *
;* FUNCTION PROPERTIES *
;* FUNCTION PROPERTIES *
;* 0 Parameter, 0 Auto, 0 SOE *
;***************************************************************

_copy
;* AR4 assigned to _str
;* AR5 assigned to _s
;* AL assigned to _n
;* AL assigned to _n
;* AR5 assigned to _s
;* AR4 assigned to _str
;* AR6 assigned to L$1
;*** 6 ----------------------- if ( n <= 0 ) goto g4;
;----------------------------------------------------------------------
; 4 | int i;
;----------------------------------------------------------------------
;----------------------------------------------------------------------
; 6 | for (i = 0; i < n; i++)
;----------------------------------------------------------------------
 CMPB AL,#0 ; |6|
 B L2,LEQ ; |6|
 ; branch occurs ; |6|
;*** ----------------------- #pragma MUST_ITERATE(1, 4294967295, 1)
:*** ----------------------- L$1 = n-1;
 ADDB AL,#-1
 MOVZ AR6,AL
 NOP
L1:
;*** 7 ----------------------- *str++ = *s++;
;*** 7 ----------------------- if ( (--L$1) != (-1) ) goto g3;
;----------------------------------------------------------------------
; 7 | *str++ = *s++;
;----------------------------------------------------------------------
 MOV AL,*XAR5++ ; |7|
 MOV *XAR4++,AL ; |7|
 BANZ L1,AR6--
 ; branch occurs ; |7|
;*** -----------------------g4:
;*** ----------------------- return;
L2:
 LRETR
 ; return occurs

3.8 Debugging and Profiling Optimized Code

Generating symbolic debugging information no longer affects the ability to optimize code. The same executable code is generated regardless of whether generation of debug information is turned on or off. For this reason, debug information is now generated by default. You do not need to specify the -g option in order to debug your application.

If you do not specify the -g option and allow the default generation of debug information to be used, the default level of optimization is used unless you specify some other optimization level.

The --symdebug:dwarf option no longer disables optimization, because generation of debug information no longer impacts optimization.

If you specify the -g option explicitly but do not specify an optimization level, no optimization is performed. This is because while generating debug information does not affect the ability to optimize code, optimizing code does make it more difficult to debug code. At higher levels of optimization, the compiler's extensive rearrangement of code and the many-to-many allocation of variables to registers often make it difficult to correlate source code with object code for debugging purposes. It is recommended that you perform debugging using the lowest level of optimization possible.

If you specify an --opt_level (aliased as -O) option, that optimization level is used no matter what type of debugging information you enabled.

The optimization level used if you do not specify the level on the command line is affected by whether or not the command line includes the -g option and the --opt_level option as shown in the following table:

Table 3-7 Interaction Between Debugging and Optimization Options

Optimization	no -g	-g
no --opt_level	--opt_level=off	--opt_level=off
--opt_level	--opt_level=2	--opt_level=2
--opt_level=n	optimized as specified	optimized as specified

Debug information increases the size of object files, but it does not affect the size of code or data on the target. If object file size is a concern and debugging is not needed, use --symdebug:none to disable the generation of debug information.

The --optimize_with_debug and --symdebug:skeletal options have been deprecated and no longer have any effect.

3.8.1 Profiling Optimized Code

To profile optimized code, use optimization (--opt_level=0 through --opt_level=3).

If you have a power profiler, use the --profile:power option with the --opt_level option. The --profile:power option produces instrument code for the power profiler.

3.9 Controlling Code Size Versus Speed

The latest mechanism for controlling the goal of optimizations in the compiler is represented by the --opt_for_speed=num option. The num denotes the level of optimization (0-5), which controls the type and degree of code size or code speed optimization:

--opt_for_speed=0

Enables optimizations geared towards improving the code size with a high risk of worsening or impacting performance.

--opt_for_speed=1

Enables optimizations geared towards improving the code size with a medium risk of worsening or impacting performance.

--opt_for_speed=2

Enables optimizations geared towards improving the code size with a low risk of worsening or impacting performance.

--opt_for_speed=3

Enables optimizations geared towards improving the code performance/speed with a low risk of worsening or impacting code size.

--opt_for_speed=4

Enables optimizations geared towards improving the code performance/speed with a medium risk of worsening or impacting code size.

--opt_for_speed=5

Enables optimizations geared towards improving the code performance/speed with a high risk of worsening or impacting code size.

If you specify the --opt_for_speed option without a parameter, the default setting is --opt_for_speed=4. If you do not specify the --opt_for_speed option, the default setting is 2

The initial mechanism for controlling code space, the --opt_for_space option, has the following equivalences with the --opt_for_speed option:

--opt_for_space	--opt_for_speed
none	=4
=0	=3
=1	=2
=2	=1
=3	=0

A fast branch (BF) instruction is generated by default when the --opt_for_speed option is used. When --opt_for_speed is not used, the compiler generates a BF instruction only when the condition code is one of NEQ, EQ, NTC and TC. The reason is that BF with these condition codes can be optimized to SBF. There is a code size penalty to use BF instruction when the condition code is NOT one of NEQ, EQ, NTC and TC. (Fast branch instructions are also generated for functions with the ramfunc function attribute.)

The --no_fast_branch option is deprecated and has no effect.

3.10 Increasing Code-Size Optimizations (--opt_for_space Option)

The --opt_for_space option increases the level of code-size optimizations performed by the compiler. These optimizations are done at the expense of performance. The optimizations include procedural abstraction where common blocks of code are replaced with function calls. For example, prolog and epilog code, certain intrinsics, and other common code sequences, can be replaced with calls to functions that are defined in the run-time library. It is necessary to link with the supplied run-time library when using the --opt_for_space option. It is not necessary to use optimization to invoke the --opt_for_space option.

To illustrate how the --opt_for_space option works, the following describes how prolog and epilog code can be replaced. This code is changed to function calls depending on the number of SOE registers, the size of the frame, and whether a frame pointer is used. These functions are defined in each file with the --opt_for_space option, as shown below:

_prolog_c28x_1
_prolog_c28x_2
_prolog_c28x_3
_epilog_c28x_1
_epilog_c28x_2

Example 3-6 provides an example of C code to be compiled with the --opt_for_space option. The resulting output is shown in Example 3-7.

Example 3-6 C Code to Show Code-Size Optimizations

extern int x, y, *ptr;
extern int foo();

int main(int a, int b, int c)
{

 ptr[50] = foo();
 y = ptr[50] + x + y + a +b + c;
}

Example 3-7 Example 3-6 Compiled With the --opt_for_space Option

FP .set XAR2
 .global _prolog_c28x_1
 .global _prolog_c28x_2
 .global _prolog_c28x_3
 .global _epilog_c28x_1
 .global _epilog_c28x_2
 .sect ".text" .global _main

;***************************************************************
;* FNAME: _main FR SIZE: 6 *
;* *
;* FUNCTION ENVIRONMENT *
;* *
;* FUNCTION PROPERTIES *
;* 0 Parameter, 0 Auto, 6 SOE *
;***************************************************************

_main:

 FFC XAR7,_prolog_c28x_1
 MOVZ AR3,AR4 ; |5|
 MOVZ AR2,AH ; |5|
 MOVZ AR1,AL ; |5|
 LCR #_foo ; |6|
 ; call occurs [#_foo] ; |6|
 MOVW DP,#_ptr
 MOVL XAR6,@_ptr ; |6|
 MOVB XAR0,#50 ; |6|
 MOVW DP,#_y
 MOV *+XAR6[AR0],AL ; |6|
 MOV AH,@_y ; |7|
 MOVW DP,#_x
 ADD AH,AL ; |7|
 ADD AH,@_x ; |7|
 ADD AH,AR3 ; |7|
 ADD AH,AR1 ; |7|
 ADD AH,AR2 ; |7|
 MOVB AL,#0
 MOVW DP,#_y
 MOV @_y,AH ; |7|
 FFC XAR7,_epilog_c28x_1
 LRETR
 ; return occurs

3.11 Compiler Support for Generating DMAC Instructions

The C28x compiler supports DMAC instructions, which perform multiply-accumulate operations on two adjacent signed integers at the same time, optionally shifting the products. A multiply–accumulate operation multiplies two numbers and adds that product to an accumulator. That is, it computes a = a + (b x c). There are three levels of DMAC support that require different levels of C-source modification:

Generate DMAC instructions automatically from C code (see Section 3.11.1).
Use assertions for data address alignment to enable DMAC instruction generation (see Section 3.11.2).
Use the __dmac intrinsic (see Section 3.11.3).

3.11.1 Automatic Generation of DMAC Instructions

The compiler automatically generates DMAC instructions if the compiler recognizes the C-language statements as a DMAC opportunity and the compiler can verify that the data addresses being operated upon are 32-bit aligned. This is the best scenario, because it requires no code modification aside from data alignment pragmas. The following is an example:

int src1[N], src2[N];
#pragma DATA_ALIGN(src1,2); // int arrays must be 32-bit aligned
#pragma DATA_ALIGN(src2,2);
 
{...}
int i;
long res = 0;

for (i = 0; i < N; i++) // N must be a known even constant
 res += (long)src1[i] * src2[i]; // Arrays must be accessed via array indices

At optimization levels >= -O2, the compiler generates a RPT || DMAC instruction for the example code above if N is a known even constant.

DMAC instructions can also shift the product left by 1 or right by 1 to 6 before accumulation. For example:

for (i = 0; i < N; i++) 
 res += (long)src1[i] * src2[i] >> 1; // product shifted right by 1

3.11.2 Assertions to Specify Data Address Alignment

In some cases the compiler may recognize a DMAC opportunity in C-language statements but not be able to verify that the data addresses passed to the computation are both 32-bit aligned. In this case, assertions placed in the code can enable the compiler to generate DMAC instructions. The following is an example:

int *src1, *src2; // src1 and src2 are pointers to int arrays of at least size N
 // You must ensure that both are 32-bit aligned addresses

{...}
int i;
long res = 0;

_nassert((long)src1 % 2 == 0);
_nassert((long)src2 % 2 == 0);

for (i = 0; i < N; i++) // N must be a known even constant
 res += (long)src1[i] * src2[i]; // src1 and src2 must be accessed via array indices

At optimization levels >= -O2, the compiler generates a RPT || DMAC instruction for the example code above if N is a known even constant.

The _nassert intrinsic generates no code and so is not a compiler intrinsic like those listed in Table 7-6. Instead, it tells the optimizer that the expression declared with the assert function is true. This can be used to give hints to the optimizer about what optimizations might be valid. In this example, _nassert is used to assert that the data addresses represented by the src1 and src2 pointers are 32-bit aligned. You are responsible for ensuring that only 32-bit aligned data addresses are passed via these pointers. The code will result in a run-time failure if the asserted conditions are not true.

DMAC instructions can also shift the product left by 1 or right by 1 to 6 before accumulation. For example:

for (i = 0; i < N; i++) 
 res += (long)src1[i] * src2[i] >> 1; // product shifted right by 1

3.11.3 __dmac Intrinsic

You can force the compiler to generate a DMAC instruction by using the __dmac intrinsic. In this case, you are fully responsible for ensuring that the data addresses are 32-bit aligned.

void __dmac(long *src1, long *src2, long &accum1, long &accum2, int shift);

Src1 and src2 must be 32-bit aligned addresses that point to int arrays.
Accum1 and accum2 are pass-by-reference longs for the two temporary accumulations. These must be added together after the intrinsic to compute the total sum.
Shift specifies how far to shift the products prior to each accumulation. Valid shift values are -1 for a left shift by 1, 0 for no shift, and 1-6 for a right shift by 1-6, respectively. Note that this argument is required, so you must pass 0 if you want no shift.

See Table 7-6 for details about the __dmac intrinsic.

Example 1:

int src1[2N], src2[2N]; // src1 and src2 are int arrays of at least size 2N
 // You must ensure that both start on 32-bit 
 // aligned boundaries.
{...}
int i;
long res = 0;
long temp = 0;

for (i=0; i < N; i++) // N does not have to be a known constant
 __dmac(((long *)src1)[i], ((long *)src2)[i], res, temp, 0);

res += temp;

Example 2:

int *src1, *src2; // src1 and src2 are pointers to int arrays of at 
 // least size 2N. User must ensure that both are 
 // 32-bit aligned addresses.
{...}
int i;
long res = 0;
long temp = 0;

long *ls1 = (long *)src1; 
long *ls2 = (long *)src2;

for (i=0; i < N; i++) // N does not have to be a known constant
 __dmac(*ls1++, *ls2++, res, temp, 0);

res += temp;

In these examples, res holds the final sum of a multiply-accumulate operation on int arrays of length 2*N, with two computations being performed at a time.

Additionally, an optimization level >= -O2 must be used to generate efficient code. Moreover, if there is nothing else in the loop body as in these examples, the compiler generates a RPT || DMAC instruction, further improving performance.

3.12 What Kind of Optimization Is Being Performed?

The TMS320C28x C/C++ compiler uses a variety of optimization techniques to improve the execution speed of your C/C++ programs and to reduce their size.

Following are some of the optimizations performed by the compiler:

Optimization	See
Cost-based register allocation	Section 3.12.1
Alias disambiguation	Section 3.12.1
Branch optimizations and control-flow simplification	Section 3.12.3
Data flow optimizations Copy propagation Common subexpression elimination Redundant assignment elimination	Section 3.12.4
Expression simplification	Section 3.12.5
Inline expansion of functions	Section 3.12.6
Function Symbol Aliasing	Section 3.12.7
Induction variable optimizations and strength reduction	Section 3.12.8
Loop-invariant code motion	Section 3.12.9
Loop rotation	Section 3.12.10
Instruction scheduling	Section 3.12.11

C28x-Specific Optimization	See
Register variables	Section 3.12.12
Register tracking/targeting	Section 3.12.13
Tail merging	Section 3.12.14
Autoincrement addressing	Section 3.12.15
Removing comparisons to zero	Section 3.12.16
RPTB generation (for FPU targets only)	Section 3.12.17

3.12.1 Cost-Based Register Allocation

The compiler, when optimization is enabled, allocates registers to user variables and compiler temporary values according to their type, use, and frequency. Variables used within loops are weighted to have priority over others, and those variables whose uses do not overlap can be allocated to the same register.

Induction variable elimination and loop test replacement allow the compiler to recognize the loop as a simple counting loop and unroll or eliminate the loop. Strength reduction turns the array references into efficient pointer references with autoincrements.

3.12.2 Alias Disambiguation

C and C++ programs generally use many pointer variables. Frequently, compilers are unable to determine whether or not two or more I values (lowercase L: symbols, pointer references, or structure references) refer to the same memory location. This aliasing of memory locations often prevents the compiler from retaining values in registers because it cannot be sure that the register and memory continue to hold the same values over time.

Alias disambiguation is a technique that determines when two pointer expressions cannot point to the same location, allowing the compiler to freely optimize such expressions.

3.12.3 Branch Optimizations and Control-Flow Simplification

The compiler analyzes the branching behavior of a program and rearranges the linear sequences of operations (basic blocks) to remove branches or redundant conditions. Unreachable code is deleted, branches to branches are bypassed, and conditional branches over unconditional branches are simplified to a single conditional branch.

When the value of a condition is determined at compile time (through copy propagation or other data flow analysis), the compiler can delete a conditional branch. Switch case lists are analyzed in the same way as conditional branches and are sometimes eliminated entirely. Some simple control flow constructs are reduced to conditional instructions, totally eliminating the need for branches.

3.12.4 Data Flow Optimizations

Collectively, the following data flow optimizations replace expressions with less costly ones, detect and remove unnecessary assignments, and avoid operations that produce values that are already computed. The compiler with optimization enabled performs these data flow optimizations both locally (within basic blocks) and globally (across entire functions).

Copy propagation. Following an assignment to a variable, the compiler replaces references to the variable with its value. The value can be another variable, a constant, or a common subexpression. This can result in increased opportunities for constant folding, common subexpression elimination, or even total elimination of the variable.
Common subexpression elimination. When two or more expressions produce the same value, the compiler computes the value once, saves it, and reuses it.
Redundant assignment elimination. Often, copy propagation and common subexpression elimination optimizations result in unnecessary assignments to variables (variables with no subsequent reference before another assignment or before the end of the function). The compiler removes these dead assignments.

3.12.5 Expression Simplification

For optimal evaluation, the compiler simplifies expressions into equivalent forms, requiring fewer instructions or registers. Operations between constants are folded into single constants. For example, a = (b + 4) - (c + 1) becomes a = b - c + 3.

3.12.6 Inline Expansion of Functions

The compiler replaces calls to small functions with inline code, saving the overhead associated with a function call as well as providing increased opportunities to apply other optimizations.

3.12.7 Function Symbol Aliasing

The compiler recognizes a function whose definition contains only a call to another function. If the two functions have the same signature (same return value and same number of parameters with the same type, in the same order), then the compiler can make the calling function an alias of the called function.

For example, consider the following:

int bbb(int arg1, char *arg2);

int aaa(int n, char *str)
{
 return bbb(n, str);
}

For this example, the compiler makes aaa an alias of bbb, so that at link time all calls to function aaa should be redirected to bbb. If the linker can successfully redirect all references to aaa, then the body of function aaa can be removed and the symbol aaa is defined at the same address as bbb.

For information about using the GCC function attribute syntax to declare function aliases, see Section 6.14.2

3.12.8 Induction Variables and Strength Reduction

Induction variables are variables whose value within a loop is directly related to the number of executions of the loop. Array indices and control variables for loops are often induction variables.

Strength reduction is the process of replacing inefficient expressions involving induction variables with more efficient expressions. For example, code that indexes into a sequence of array elements is replaced with code that increments a pointer through the array.

Induction variable analysis and strength reduction together often remove all references to your loop-control variable, allowing its elimination.

3.12.9 Loop-Invariant Code Motion

This optimization identifies expressions within loops that always compute to the same value. The computation is moved in front of the loop, and each occurrence of the expression in the loop is replaced by a reference to the precomputed value.

3.12.10 Loop Rotation

The compiler evaluates loop conditionals at the bottom of loops, saving an extra branch out of the loop. In many cases, the initial entry conditional check and the branch are optimized out.

3.12.11 Instruction Scheduling

The compiler performs instruction scheduling, which is the rearranging of machine instructions in such a way that improves performance while maintaining the semantics of the original order. Instruction scheduling is used to improve instruction parallelism and hide latencies. It can also be used to reduce code size.

3.12.12 Register Variables

The compiler helps maximize the use of registers for storing local variables, parameters, and temporary values. Accessing variables stored in registers is more efficient than accessing variables in memory. Register variables are particularly effective for pointers.

3.12.13 Register Tracking/Targeting

The compiler tracks the contents of registers to avoid reloading values if they are used again soon. Variables, constants, and structure references such as (a.b) are tracked through straight-line code. Register targeting also computes expressions directly into specific registers when required, as in the case of assigning to register variables or returning values from functions.

3.12.14 Tail Merging

If you are optimizing for code size, tail merging can be very effective for some functions. Tail merging finds basic blocks that end in an identical sequence of instructions and have a common destination. If such a set of blocks is found, the sequence of identical instructions is made into its own block. These instructions are then removed from the set of blocks and replaced with branches to the newly created block. Thus, there is only one copy of the sequence of instructions, rather than one for each block in the set.

3.12.15 Autoincrement Addressing

For pointer expressions of the form *p++, the compiler uses efficient C28x autoincrement addressing modes. In many cases, where code steps through an array in a loop such as below, the loop optimizations convert the array references to indirect references through autoincremented register variable pointers.

for (I = 0; I <N; ++I) a(I)...

3.12.16 Removing Comparisons to Zero

Because most of the 32-bit instructions and some of the 16-bit instructions can modify the status register when the result of their operation is 0, explicit comparisons to 0 may be unnecessary. The C28x C/C++ compiler removes comparisons to 0 if a previous instruction can be modified to set the status register appropriately.

3.12.17 RPTB Generation (for FPU Targets Only)

When the target has hardware floating-point support, some loops can be converted to hardware loops called repeat blocks (RPTB). Normally, a loop looks like this:

Label:
 ...loop body...
 SUB loop_count
 CMP
 B Label

The same loop, when converted to a RPTB loop, looks like this:

   RPTB end_label, loop_count
 ...loop body...
end_label:

A repeat block loop is loaded into a hardware buffer and executed for the specified number of iterations. This kind of loop has minimal or zero branching overhead, and can improve performance. The loop count is stored in a special register RB (repeat block register), and the hardware seamlessly decrements the count without any explicit subtractions. Thus, there is no overhead due to the subtract, the compare, and the branch. The only overhead is due to the RPTB instruction that executes once before the loop. The RPTB instruction takes one cycle if the number of iterations is a constant, and 4 cycles otherwise. This overhead is incurred once per loop.

There are limitations on the minimum and maximum loop size for a loop to qualify for becoming a repeat block, due to the presence of the buffer. Also, the loop cannot contain any inner loops or function calls.

Submit Documentation Feedback

Copyright© 2015, Texas Instruments Incorporated. An IMPORTANT NOTICE for this document addresses availability, warranty, changes, use in safety-critical applications, intellectual property matters and other important disclaimers.