3 Optimizing Your Code

ARM Optimizing C/C++ Compiler v16.12.0.STS User's Guide
SPNU151N - REVISED DECEMBER 2016

3 Optimizing Your Code

The compiler tools can perform many optimizations to improve the execution speed and reduce the size of C and C++ programs by simplifying loops, rearranging statements and expressions, and allocating variables into registers.

This chapter describes how to invoke different levels of optimization and describes which optimizations are performed at each level. This chapter also describes how you can use the Interlist feature when performing optimization and how you can profile or debug optimized code.

3.1 Invoking Optimization

The C/C++ compiler is able to perform various optimizations. High-level optimizations are performed in the optimizer and low-level, target-specific optimizations occur in the code generator. Use high-level optimization levels, such as --opt_level=2 and --opt_level=3, to achieve optimal code.

The easiest way to invoke optimization is to use the compiler program, specifying the --opt_level=n option on the compiler command line. You can use -On to alias the --opt_level option. The n denotes the level of optimization (0, 1, 2, 3, and 4), which controls the type and degree of optimization.

--opt_level=off or -Ooff
- Performs no optimization
--opt_level=0 or -O0
- Performs control-flow-graph simplification
- Allocates variables to registers
- Performs loop rotation
- Eliminates unused code
- Simplifies expressions and statements
- Expands calls to functions declared inline
--opt_level=1 or -O1

Performs all --opt_level=0 (-O0) optimizations, plus:

Performs local copy/constant propagation
Removes unused assignments
Eliminates local common expressions

--opt_level=2 or -O2

Performs all --opt_level=1 (-O1) optimizations, plus:

Performs loop optimizations
Eliminates global common subexpressions
Eliminates global unused assignments
Performs loop unrolling

--opt_level=3 or -O3

Performs all --opt_level=2 (-O2) optimizations, plus:

Removes all functions that are never called
Simplifies functions with return values that are never used
Inlines calls to small functions
Reorders function declarations; the called functions attributes are known when the caller is optimized
Propagates arguments into function bodies when all calls pass the same value in the same argument position
Identifies file-level variable characteristics

If you use --opt_level=3 (-O3), see Section 3.2 and Section 3.3 for more information.

--opt_level=4 or -O4

Performs link-time optimization. See Section 3.4 for details.

By default, debugging is enabled and the default optimization level is unaffected by the generation of debug information. However, the optimization level used is affected by whether or not the command line includes the -g (--symdebug:dwarf) option and the --opt_level option as shown in the following table:

Table 3-1 Interaction Between Debugging and Optimization Options

Optimization	no -g	-g
no --opt_level	--opt_level=3	--opt_level=off
--opt_level	--opt_level=3	--opt_level=3
--opt_level=n	optimized as specified	optimized as specified

The levels of optimizations described above are performed by the stand-alone optimization pass. The code generator performs several additional optimizations, particularly processor-specific optimizations. It does so regardless of whether you invoke the optimizer. These optimizations are always enabled, although they are more effective when the optimizer is used.

3.2 Performing File-Level Optimization (--opt_level=3 option)

The --opt_level=3 option (aliased as the -O3 option) instructs the compiler to perform file-level optimization. This is the default optimization level. You can use the --opt_level=3 option alone to perform general file-level optimization, or you can combine it with other options to perform more specific optimizations. The options listed in Table 3-2 work with --opt_level=3 to perform the indicated optimization:

Table 3-2 Options That You Can Use With --opt_level=3

If You ...	Use this Option	See
Want to create an optimization information file	--gen_opt_level=n	Section 3.2.1
Want to compile multiple source files	--program_level_compile	Section 3.3

3.2.1 Creating an Optimization Information File (--gen_opt_info Option)

When you invoke the compiler with the --opt_level=3 option (the default), you can use the --gen_opt_info option to create an optimization information file that you can read. The number following the option denotes the level (0, 1, or 2). The resulting file has an .nfo extension. Use Table 3-3 to select the appropriate level to append to the option.

Table 3-3 Selecting a Level for the --gen_opt_info Option

If you...	Use this option
Do not want to produce an information file, but you used the --gen_opt_level=1 or --gen_opt_level=2 option in a command file or an environment variable. The --gen_opt_level=0 option restores the default behavior of the optimizer.	--gen_opt_info=0
Want to produce an optimization information file	--gen_opt_info=1
Want to produce a verbose optimization information file	--gen_opt_info=2

3.3 Program-Level Optimization (--program_level_compile and --opt_level=3 options)

You can specify program-level optimization by using the --program_level_compile option with the --opt_level=3 option (aliased as -O3). (If you use --opt_level=4 (-O4), the --program_level_compile option cannot be used, because link-time optimization provides the same optimization opportunities as program level optimization.)

With program-level optimization, all of your source files are compiled into one intermediate file called a module. The module moves to the optimization and code generation passes of the compiler. Because the compiler can see the entire program, it performs several optimizations that are rarely applied during file-level optimization:

If a particular argument in a function always has the same value, the compiler replaces the argument with the value and passes the value instead of the argument.
If a return value of a function is never used, the compiler deletes the return code in the function.
If a function is not called directly or indirectly by main(), the compiler removes the function.

The --program_level_compile option requires use of --opt_level=3 or higher in order to perform these optimizations.

To see which program-level optimizations the compiler is applying, use the --gen_opt_level=2 option to generate an information file. See Section 3.2.1 for more information.

In Code Composer Studio, when the --program_level_compile option is used, C and C++ files that have the same options are compiled together. However, if any file has a file-specific option that is not selected as a project-wide option, that file is compiled separately. For example, if every C and C++ file in your project has a different set of file-specific options, each is compiled separately, even though program-level optimization has been specified. To compile all C and C++ files together, make sure the files do not have file-specific options. Be aware that compiling C and C++ files together may not be safe if previously you used a file-specific option.

NOTE

Compiling Files With the --program_level_compile and --keep_asm Options

If you compile all files with the --program_level_compile and --keep_asm options, the compiler produces only one .asm file, not one for each corresponding source file.

3.3.1 Controlling Program-Level Optimization (--call_assumptions Option)

You can control program-level optimization, which you invoke with --program_level_compile --opt_level=3, by using the --call_assumptions option. Specifically, the --call_assumptions option indicates if functions in other modules can call a module's external functions or modify a module's external variables. The number following --call_assumptions indicates the level you set for the module that you are allowing to be called or modified. The --opt_level=3 option combines this information with its own file-level analysis to decide whether to treat this module's external function and variable declarations as if they had been declared static. Use Table 3-4 to select the appropriate level to append to the --call_assumptions option.

Table 3-4 Selecting a Level for the --call_assumptions Option

If Your Module …	Use this Option
Has functions that are called from other modules and global variables that are modified in other modules	--call_assumptions=0
Does not have functions that are called by other modules but has global variables that are modified in other modules	--call_assumptions=1
Does not have functions that are called by other modules or global variables that are modified in other modules	--call_assumptions=2
Has functions that are called from other modules but does not have global variables that are modified in other modules	--call_assumptions=3

In certain circumstances, the compiler reverts to a different --call_assumptions level from the one you specified, or it might disable program-level optimization altogether. Table 3-5 lists the combinations of --call_assumptions levels and conditions that cause the compiler to revert to other --call_assumptions levels.

Table 3-5 Special Considerations When Using the --call_assumptions Option

If --call_assumptions is...	Under these Conditions...	Then the --call_assumptions Level...
Not specified	The --opt_level=3 optimization level was specified	Defaults to --call_assumptions=2
Not specified	The compiler sees calls to outside functions under the --opt_level=3 optimization level	Reverts to --call_assumptions=0
Not specified	Main is not defined	Reverts to --call_assumptions=0
--call_assumptions=1 or --call_assumptions=2	No function has main defined as an entry point, and no interrupt functions are defined, and no functions are identified by the FUNC_EXT_CALLED pragma	Reverts to --call_assumptions=0
--call_assumptions=1 or --call_assumptions=2	A main function is defined, or, an interrupt function is defined, or a function is identified by the FUNC_EXT_CALLED pragma	Remains --call_assumptions=1 or --call_assumptions=2
--call_assumptions=3	Any condition	Remains --call_assumptions=3

In some situations when you use --program_level_compile and --opt_level=3, you must use a --call_assumptions option or the FUNC_EXT_CALLED pragma. See Section 3.3.2 for information about these situations.

3.3.2 Optimization Considerations When Mixing C/C++ and Assembly

If you have any assembly functions in your program, you need to exercise caution when using the --program_level_compile option. The compiler recognizes only the C/C++ source code and not any assembly code that might be present. Because the compiler does not recognize the assembly code calls and variable modifications to C/C++ functions, the --program_level_compile option optimizes out those C/C++ functions. To keep these functions, place the FUNC_EXT_CALLED pragma (see Section 5.10.13) before any declaration or reference to a function that you want to keep.

Another approach you can take when you use assembly functions in your program is to use the --call_assumptions=n option with the --program_level_compile and --opt_level=3 options. See Section 3.3.1 for information about the --call_assumptions=n option.

In general, you achieve the best results through judicious use of the FUNC_EXT_CALLED pragma in combination with --program_level_compile --opt_level=3 and --call_assumptions=1 or --call_assumptions=2.

If any of the following situations apply to your application, use the suggested solution:

Situation

Your application consists of C/C++ source code that calls assembly functions. Those assembly functions do not call any C/C++ functions or modify any C/C++ variables.

Solution

Compile with --program_level_compile --opt_level=3 --call_assumptions=2 to tell the compiler that outside functions do not call C/C++ functions or modify C/C++ variables.

If you compile with the --program_level_compile --opt_level=3 options only, the compiler reverts from the default optimization level (--call_assumptions=2) to --call_assumptions=0. The compiler uses --call_assumptions=0, because it presumes that the calls to the assembly language functions that have a definition in C/C++ may call other C/C++ functions or modify C/C++ variables.

Situation

Your application consists of C/C++ source code that calls assembly functions. The assembly language functions do not call C/C++ functions, but they modify C/C++ variables.

Solution

Try both of these solutions and choose the one that works best with your code:

Compile with --program_level_compile --opt_level=3 --call_assumptions=1.
Add the volatile keyword to those variables that may be modified by the assembly functions and compile with --program_level_compile --opt_level=3 --call_assumptions=2.

Situation

Your application consists of C/C++ source code and assembly source code. The assembly functions are interrupt service routines that call C/C++ functions; the C/C++ functions that the assembly functions call are never called from C/C++. These C/C++ functions act like main: they function as entry points into C/C++.

Solution

Add the volatile keyword to the C/C++ variables that may be modified by the interrupts. Then, you can optimize your code in one of these ways:

You achieve the best optimization by applying the FUNC_EXT_CALLED pragma to all of the entry-point functions called from the assembly language interrupts, and then compiling with --program_level_compile --opt_level=3 --call_assumptions=2. Be sure that you use the pragma with all of the entry-point functions. If you do not, the compiler might remove the entry-point functions that are not preceded by the FUNC_EXT_CALLED pragma.
Compile with --program_level_compile --opt_level=3 --call_assumptions=3. Because you do not use the FUNC_EXT_CALLED pragma, you must use the --call_assumptions=3 option, which is less aggressive than the --call_assumptions=2 option, and your optimization may not be as effective.

Keep in mind that if you use --program_level_compile --opt_level=3 without additional options, the compiler removes the C functions that the assembly functions call. Use the FUNC_EXT_CALLED pragma to keep these functions.

3.4 Link-Time Optimization (--opt_level=4 Option)

Link-time optimization is an optimization mode that allows the compiler to have visibility of the entire program. The optimization occurs at link-time instead of compile-time like other optimization levels.

Link-time optimization is invoked by using the --opt_level=4 option. This option must be used in both the compilation and linking steps. At compile time, the compiler embeds an intermediate representation of the file being compiled into the resulting object file. At link-time this representation is extracted from every object file which contains it, and is used to optimize the entire program.

If you use --opt_level=4 (-O4), the --program_level_compile option cannot also be used, because link-time optimization provides the same optimization opportunities as program level optimization (Section 3.3). Link-time optimization provides the following benefits:

Each source file can be compiled separately. One issue with program-level compilation is that it requires all source files to be passed to the compiler at one time. This often requires significant modification of a customer's build process. With link-time optimization, all files can be compiled separately.
References to C/C++ symbols from assembly are handled automatically. When doing program-level compilation, the compiler has no knowledge of whether a symbol is referenced externally. When performing link-time optimization during a final link, the linker can determine which symbols are referenced externally and prevent eliminating them during optimization.
Third party object files can participate in optimization. If a third party vendor provides object files that were compiled with the --opt_level=4 option, those files participate in optimization along with user-generated files. This includes object files supplied as part of the TI run-time support. Object files that were not compiled with –opt_level=4 can still be used in a link that is performing link-time optimization. Those files that were not compiled with –opt_level=4 do not participate in the optimization.
Source files can be compiled with different option sets. With program-level compilation, all source files must be compiled with the same option set. With link-time optimization files can be compiled with different options. If the compiler determines that two options are incompatible, it issues an error.

3.4.1 Option Handling

When performing link-time optimization, source files can be compiled with different options. When possible, the options that were used during compilation are used during link-time optimization. For options which apply at the program level, --auto_inline for instance, the options used to compile the main function are used. If main is not included in link-time optimization, the option set used for the first object file specified on the command line is used. Some options, --opt_for_speed for instance, can affect a wide range of optimizations. For these options, the program-level behavior is derived from main, and the local optimizations are obtained from the original option set.

Some options are incompatible when performing link-time optimization. These are usually options which conflict on the command line as well, but can also be options that cannot be handled during link-time optimization.

3.4.2 Incompatible Types

During a normal link, the linker does not check to make sure that each symbol was declared with the same type in different files. This is not necessary during a normal link. When performing link-time optimization, however, the linker must ensure that all symbols are declared with compatible types in different source files. If a symbol is found which has incompatible types, an error is issued. The rules for compatible types are derived from the C and C++ standards.

3.5 Using Feedback Directed Optimization

Feedback directed optimization provides a method for finding frequently executed paths in an application using compiler-based instrumentation. This information is fed back to the compiler and is used to perform optimizations. This information is also used to provide you with information about application behavior.

3.5.1 Feedback Directed Optimization

Feedback directed optimization uses run-time feedback to identify and optimize frequently executed program paths. Feedback directed optimization is a two-phase process.

3.5.1.1 Phase 1 -- Collect Program Profile Information

In this phase the compiler is invoked with the option --gen_profile_info, which instructs the compiler to add instrumentation code to collect profile information. The compiler inserts a minimal amount of instrumentation code to determine control flow frequencies. Memory is allocated to store counter information.

The instrumented application program is executed on the target using representative input data sets. The input data sets should correlate closely with the way the program is expected to be used in the end product environment. When the program completes, a run-time-support function writes the collected information into a profile data file called a PDAT file. Multiple executions of the program using different input data sets can be performed and in such cases, the run-time-support function appends the collected information into the PDAT file. The resulting PDAT file is post-processed using a tool called the Profile Data Decoder or armpdd. The armpdd tool consolidates multiple data sets and formats the data into a feedback file (PRF file, see Section 3.5.2) for consumption by phase 2 of feedback directed optimization.

3.5.1.2 Phase 2 -- Use Application Profile Information for Optimization

In this phase, the compiler is invoked with the --use_profile_info=file.prf option, which reads the specified PRF file generated in phase 1. In phase 2, optimization decisions are made using the data generated during phase 1. The profile feedback file is used to guide program optimization. The compiler optimizes frequently executed program paths more aggressively.

The compiler uses data in the profile feedback file to guide certain optimizations of frequently executed program paths.

3.5.1.3 Generating and Using Profile Information

There are two options that control feedback directed optimization:

--gen_profile_info

tells the compiler to add instrumentation code to collect profile information. When the program executes the run-time-support exit() function, the profile data is written to a PDAT file. This option applies to all the C/C++ source files being compiled on the command-line.

If the environment variable TI_PROFDATA on the host is set, the data is written into the specified file. Otherwise, it uses the default filename: pprofout.pdat. The full pathname of the PDAT file (including the directory name) can be specified using the TI_PROFDATA host environment variable.

By default, the RTS profile data output routine uses the C I/O mechanism to write data to the PDAT file. You can install a device handler for the PPHNDL device to re-direct the profile data to a custom device driver routine. For example, this could be used to send the profile data to a device that does not use a file system.

Feedback directed optimization requires you to turn on at least some debug information when using the --gen_profile_info option. This enables the compiler to output debug information that allows armpdd to correlate compiled functions and their associated profile data.

--use_profile_info

specifies the profile information file(s) to use for performing phase 2 of feedback directed optimization. More than one profile information file can be specified on the command line; the compiler uses all input data from multiple information files. The syntax for the option is:

--use_profile_info==file1[, file2, ..., filen]

If no filename is specified, the compiler looks for a file named pprofout.prf in the directory where the compiler in invoked.

3.5.1.4 Example Use of Feedback Directed Optimization

These steps illustrate the creation and use of feedback directed optimization.

Generate profile information.

armcl --opt_level=2 --gen_profile_info foo.c --run_linker --output_file=foo.out
 --library=lnk.cmd --library=rtsv4_A_be_eabi.lib

Execute the application.

The execution of the application creates a PDAT file named pprofout.pdat in the current (host) directory. The application can be run on on target hardware connected to a host machine.

Process the profile data.

After running the application with multiple data-sets, run armpdd on the PDAT files to create a profile information (PRF) file to be used with --use_profile_info.

armppd -e foo.out -o pprofout.prf pprofout.pdat

Re-compile using the profile feedback file.

armcl --opt_level=2 --use_profile_info=pprofout.prf foo.c --run_linker
 --output_file=foo.out --library=lnk.cmd --library=rtsv4_A_be_eabi.lib

3.5.1.5 The .ppdata Section

The profile information collected in phase 1 is stored in the .ppdata section, which must be allocated into target memory. The .ppdata section contains profiler counters for all functions compiled with --gen_profile_info. The default lnk.cmd file in has directives to place the .ppdata section in data memory. If the link command file has no section directive for allocating .ppdata section, the link step places the .ppdata section in a writable memory range.

The .ppdata section must be allocated memory in multiples of 32 bytes. Please refer to the linker command file in the distribution for example usage.

3.5.1.6 Feedback Directed Optimization and Code Size Tune

Feedback directed optimization is different from the Code Size Tune feature in Code Composer Studio (CCS). The code size tune feature uses CCS profiling to select specific compilation options for each function in order to minimize code size while still maintaining a specific performance point. Code size tune is coarse-grained, since it is selecting an option set for the whole function. Feedback directed optimization selects different optimization goals along specific regions within a function.

3.5.1.7 Instrumented Program Execution Overhead

During profile collection, the execution time of the application may increase. The amount of increase depends on the size of the application and the number of files in the application compiled for profiling.

The profiling counters increase the code and data size of the application. Consider using the --opt_for_space (-ms) code size options when using profiling to mitigate the code size increase. This has no effect on the accuracy of the profile data being collected. Since profiling only counts execution frequency and not cycle counts, code size optimization flags do not affect profiler measurements.

3.5.1.8 Invalid Profile Data

When recompiling with --use_profile_info, the profile information is invalid in the following cases:

The source file name changed between the generation of profile information (gen-profile) and the use of the profile information (use-profile).
The source code was modified since gen-profile. In this case, profile information is invalid for the modified functions.
Certain compiler options used with gen-profile are different from those with used with use-profile. In particular, options that affect parser behavior could invalidate profile data during use-profile. In general, using different optimization options during use-profile should not affect the validity of profile data.

3.5.2 Profile Data Decoder

The code generation tools include a tool called the Profile Data Decoder or armpdd, which is used for post processing profile data (PDAT) files. The armpdd tool generates a profile feedback (PRF) file. See Section 3.5.1 for a discussion of where armpdd fits in the profiling flow. The armpdd tool is invoked with this syntax:

armpdd -eexec.out -oapplication.prffilename.pdat

-a	Computes the average of the data values in the data sets instead of accumulating data values
-eexec.out	Specifies exec.out is the name of the application executable.
-oapplication.prf	Specifies application.prf is the formatted profile feedback file that is used as the argument to --use_profile_info during recompilation. If no output file is specified, the default output filename is pprofout.prf.
filename.pdat	Is the name of the profile data file generated by the run-time-support function. This is the default name and it can be overridden by using the host environment variable TI_PROFDATA.

The run-time-support function and armpdd append to their respective output files and do not overwrite them. This enables collection of data sets from multiple runs of the application.

NOTE

Profile Data Decoder Requirements

Your application must be compiled with at least DWARF debug support to enable feedback directed optimization. When compiling for feedback directed optimization, the armpdd tool relies on basic debug information about each function in generating the formatted .prf file.

The pprofout.pdat file generated by the run-time support is a raw data file of a fixed format understood only by armpdd. You should not modify this file in any way.

3.5.3 Feedback Directed Optimization API

There are two user interfaces to the profiler mechanism. You can start and stop profiling in your application by using the following run-time-support calls.

_TI_start_pprof_collection()

This interface informs the run-time support that you wish to start profiling collection from this point on and causes the run-time support to clear all profiling counters in the application (that is, discard old counter values).

_TI_stop_pprof_collection()

This interface directs the run-time support to stop profiling collection and output profiling data into the output file (into the default file or one specified by the TI_PROFDATA host environment variable). The run-time support also disables any further output of profile data into the output file during exit(), unless you call _TI_start_pprof_collection() again.

3.5.4 Feedback Directed Optimization Summary

Options

--gen_profile_info	Adds instrumentation to the compiled code. Execution of the code results in profile data being emitted to a PDAT file.
--use_profile_info=file.prf	Uses profile information for optimization and/or generating code coverage information.
--analyze=codecov	Generates a code coverage information file and continues with profile-based compilation. Must be used with --use_profile_info.
--analyze_only	Generates only a code coverage information file. Must be used with --use_profile_info. You must specify both --analyze=codecov and --analyze_only to do code coverage analysis of the instrumented application.

Host Environment Variables

TI_PROFDATA	Writes profile data into the specified file
TI_COVDIR	Creates code coverage files in the specified directory
TI_COVDATA	Writes code coverage data into the specified file

API

_TI_start_pprof_collection()	Clears the profile counters to file
_TI_stop_pprof_collection()	Writes out all profile counters to file
PPHDNL	Device driver handle for low-level C I/O based driver for writing out profile data from a target program.

Files Created

*.pdat	Profile data file, which is created by executing an instrumented program and used as input to the profile data decoder
*.prf	Profiling feedback file, which is created by the profile data decoder and used as input to the re-compilation step

3.6 Using Profile Information to Analyze Code Coverage

You can use the analysis information from the Profile Data Decoder to analyze code coverage.

3.6.1 Code Coverage

The information collected during feedback directed optimization can be used for generating code coverage reports. As with feedback directed optimization, the program must be compiled with the --gen_profile_info option.

Code coverage conveys the execution count of each line of source code in the file being compiled, using data collected during profiling.

3.6.1.1 Phase1 -- Collect Program Profile Information

3.6.1.2 Phase 2 -- Generate Code Coverage Reports

In this phase, the compiler is invoked with the --use_profile_info=file.prf option, which indicates that the compiler should read the specified PRF file generated in phase 1. The application must also be compiled with either the --codecov or --onlycodecov option; the compiler generates a code-coverage info file. The --codecov option directs the compiler to continue compilation after generating code-coverage information, while the --onlycodecov option stops the compiler after generating code-coverage data. For example:

armcl --opt_level=2 --use_profile_info=pprofout.prf --onlycodecov foo.c

You can specify two environment variables to control the destination of the code-coverage information file.

The TI_COVDIR environment variable specifies the directory where the code-coverage file should be generated. The default is the directory where the compiler is invoked.
The TI_COVDATA environment variable specifies the name of the code-coverage data file generated by the compiler. the default is filename.csv where filename is the base-name of the file being compiled. For example, if foo.c is being compiled, the default code-coverage data file name is foo.csv.

If the code-coverage data file already exists, the compiler appends the new dataset at the end of the file.

Code-coverage data is a comma-separated list of data items that can be conveniently handled by data-processing tools and scripting languages. The following is the format of code-coverage data:

"filename-with-full-path","funcname",line#,column#,exec-frequency,"comments"

"filename-with-full-path"	Full pathname of the file corresponding to the entry
"funcname"	Name of the function
line#	Line number of the source line corresponding to frequency data
column#	Column number of the source line
exec-frequency	Execution frequency of the line
"comments"	Intermediate-level representation of the source-code generated by the parser
The full filename, function name, and comments appear within quotation marks ("). For example: `"/some_dir/zlib/arm/deflate.c","_deflateInit2_",216,5,1,"( strm->zalloc )"` Other tools, such as a spreadsheet program, can be used to format and view the code coverage data.

3.6.2 Related Features and Capabilities

The code generation tools provide some features and capabilities that can be used in conjunction with code coverage analysis. The following is a summary:

3.6.2.1 Path Profiler

The code generation tools include a path profiling utility, armpprof, that is run from the compiler, armcl. The armpprof utility is invoked by the compiler when the --gen_profile or the --use_profile command is used from the compiler command line:

armcl --gen_profile ... file.c

armcl --use_profile ... file.c

For further information about profile-based optimization and a more detailed description of the profiling infrastructure, see Section 3.5.

3.6.2.2 Analysis Options

The path profiling utility, armpprof, appends code coverage information to existing CSV (comma separated values) files that contain the same type of analysis information.

The utility checks to make sure that an existing CSV file contains analysis information that is consistent with the type of analysis information it is being asked to generate. Attempts to mix code coverage and other analysis information in the same output CSV file will be detected, and armpprof will emit a fatal error and abort.

--analyze=codecov	Instructs the compiler to generate code coverage analysis information. This option replaces the previous --codecov option.
--analyze_only	Halts compilation after generation of analysis information is completed.

3.6.2.3 Environment Variables

To assist with the management of output CSV analysis files, armpprof supports this environment variable:

TI_ANALYSIS_DIR

Specifies the directory in which the output analysis file will be generated.

3.7 Accessing Aliased Variables in Optimized Code

Aliasing occurs when a single object can be accessed in more than one way, such as when two pointers point to the same object or when a pointer points to a named object. Aliasing can disrupt optimization because any indirect reference can refer to another object. The optimizer analyzes the code to determine where aliasing can and cannot occur, then optimizes as much as possible while still preserving the correctness of the program. The optimizer behaves conservatively. If there is a chance that two pointers are pointing to the same object, then the optimizer assumes that the pointers do point to the same object.

The compiler assumes that if the address of a local variable is passed to a function, the function changes the local variable by writing through the pointer. This makes the local variable's address unavailable for use elsewhere after returning. For example, the called function cannot assign the local variable's address to a global variable or return the local variable's address. In cases where this assumption is invalid, use the --aliased_variables compiler option to force the compiler to assume worst-case aliasing. In worst-case aliasing, any indirect reference can refer to such a variable.

3.8 Use Caution With asm Statements in Optimized Code

You must be extremely careful when using asm (inline assembly) statements in optimized code. The compiler rearranges code segments, uses registers freely, and can completely remove variables or expressions. Although the compiler never optimizes out an asm statement (except when it is unreachable), the surrounding environment where the assembly code is inserted can differ significantly from the original C/C++ source code.

It is usually safe to use asm statements to manipulate hardware controls such as interrupt masks, but asm statements that attempt to interface with the C/C++ environment or access C/C++ variables can have unexpected results. After compilation, check the assembly output to make sure your asm statements are correct and maintain the integrity of the program.

3.9 Automatic Inline Expansion (--auto_inline Option)

When optimizing with the --opt_level=3 option (aliased as -O3), the compiler automatically inlines small functions. A command-line option, --auto_inline=size, specifies the size threshold for automatic inlining. This option controls only the inlining of functions that are not explicitly declared as inline.

When the --auto_inline option is not used, the compiler sets the size limit based on the optimization level and the optimization goal (performance versus code size). If the -auto_inline size parameter is set to 0, automatic inline expansion is disabled. If the --auto_inline size parameter is set to a non-zero integer, the compiler automatically inlines any function smaller than size. (This is a change from previous releases, which inlined functions for which the product of the function size and the number of calls to it was less than size. The new scheme is simpler, but will usually lead to more inlining for a given value of size.)

The compiler measures the size of a function in arbitrary units; however the optimizer information file (created with the --gen_opt_info=1 or --gen_opt_info=2 option) reports the size of each function in the same units that the --auto_inline option uses. When --auto_inline is used, the compiler does not attempt to prevent inlining that causes excessive growth in compile time or size; use with care.

When --auto_inline option is not used, the decision to inline a function at a particular call-site is based on an algorithm that attempts to optimize benefit and cost. The compiler inlines eligible functions at call-sites until a limit on size or compilation time is reached.

When deciding what to inline, the compiler collects all eligible call-sites in the module being compiled and sorts them by the estimated benefit over cost. Functions declared static inline are ordered first, then leaf functions, then all others eligible. Functions that are too big are not included.

Inlining behavior varies, depending on which compile-time options are specified:

The code size limit is smaller when compiling for code size rather than performance. The --auto_inline option overrides this size limit.
At --opt_level=3, the compiler automatically inlines small functions.
At --opt_level=4, the compiler auto-inlines aggressively if compiling for performance.

NOTE

Some Functions Cannot Be Inlined

For a call-site to be considered for inlining, it must be legal to inline the function and inlining must not be disabled in some way. See the inlining restrictions in Section 2.11.3.

NOTE

Optimization Level 3 and Inlining

In order to turn on automatic inlining, you must use the --opt_level=3 option. If you desire the --opt_level=3 optimizations, but not automatic inlining, use --auto_inline=0 with the --opt_level=3 option.

NOTE

Inlining and Code Size

Expanding functions inline increases code size, especially inlining a function that is called in a number of places. Function inlining is optimal for functions that are called only from a small number of places and for small functions. To prevent increases in code size because of inlining, use the --auto_inline=0 option. This option causes the compiler to inline intrinsics only.

3.10 Using the Interlist Feature With Optimization

You control the output of the interlist feature when compiling with optimization (the --opt_level=n or -On option) with the --optimizer_interlist and --c_src_interlist options.

The --optimizer_interlist option interlists compiler comments with assembly source statements.
The --c_src_interlist and --optimizer_interlist options together interlist the compiler comments and the original C/C++ source with the assembly code.

When you use the --optimizer_interlist option with optimization, the interlist feature does not run as a separate pass. Instead, the compiler inserts comments into the code, indicating how the compiler has rearranged and optimized the code. These comments appear in the assembly language file as comments starting with ;**. The C/C++ source code is not interlisted, unless you use the --c_src_interlist option also.

The interlist feature can affect optimized code because it might prevent some optimization from crossing C/C++ statement boundaries. Optimization makes normal source interlisting impractical, because the compiler extensively rearranges your program. Therefore, when you use the --optimizer_interlist option, the compiler writes reconstructed C/C++ statements.

Example 3-1 shows a function that has been compiled with optimization (--opt_level=2) and the --optimizer_interlist option. The assembly file contains compiler comments interlisted with assembly code.

NOTE

Impact on Performance and Code Size

The --c_src_interlist option can have a negative effect on performance and code size.

When you use the --c_src_interlist and --optimizer_interlist options with optimization, the compiler inserts its comments and the interlist feature runs before the assembler, merging the original C/C++ source into the assembly file.

Example 3-2 shows the function from Example 3-1 compiled with the optimization (--opt_level=2) and the --c_src_interlist and --optimizer_interlist options. The assembly file contains compiler comments and C source interlisted with assembly code.

Example 3-1 The Function From Example 2-1 Compiled With the -O2 and --optimizer_interlist Options

_main:
 STMFD SP!, {LR}
;** 5	----------------------- printf("Hello, world\n");
 ADR A1, SL1
 BL _printf
;** 6	----------------------- return 0;
 MOV A1, #0
 LDMFD SP!, {PC}

Example 3-2 The Function From Example 2-1 Compiled with the --opt_level=2, --optimizer_interlist, and --c_src_interlist Options

_main:
 STMFD SP!, {LR}
;** 5	----------------------- printf("Hello, world\n");
;------------------------------------------------------------------------------
; 5 | printf("Hello, world\n");
;------------------------------------------------------------------------------
 ADR A1, SL1
 BL _printf
;** 6	----------------------- return 0;
;------------------------------------------------------------------------------
; 6 | return 0;
;------------------------------------------------------------------------------
 MOV A1, #0
 LDMFD SP!, {PC}

3.11 Debugging and Profiling Optimized Code

Generating symbolic debugging information no longer affects the ability to optimize code. The same executable code is generated regardless of whether generation of debug information is turned on or off. For this reason, debug information is now generated by default. You do not need to specify the -g option in order to debug your application.

If you do not specify the -g option and allow the default generation of debug information to be used, the default level of optimization is used unless you specify some other optimization level.

The --symdebug:dwarf option no longer disables optimization, because generation of debug information no longer impacts optimization.

If you specify the -g option explicitly but do not specify an optimization level, no optimization is performed. This is because while generating debug information does not affect the ability to optimize code, optimizing code does make it more difficult to debug code. At higher levels of optimization, the compiler's extensive rearrangement of code and the many-to-many allocation of variables to registers often make it difficult to correlate source code with object code for debugging purposes. It is recommended that you perform debugging using the lowest level of optimization possible.

If you specify an --opt_level (aliased as -O) option, that optimization level is used no matter what type of debugging information you enabled.

The optimization level used if you do not specify the level on the command line is affected by whether or not the command line includes the -g option and the --opt_level option as shown in the following table:

Table 3-6 Interaction Between Debugging and Optimization Options

Optimization	no -g	-g
no --opt_level	--opt_level=3	--opt_level=off
--opt_level	--opt_level=3	--opt_level=3
--opt_level=n	optimized as specified	optimized as specified

Debug information increases the size of object files, but it does not affect the size of code or data on the target. If object file size is a concern and debugging is not needed, use --symdebug:none to disable the generation of debug information.

3.11.1 Profiling Optimized Code

To profile optimized code, use optimization (--opt_level=0 through --opt_level=3).

3.12 Controlling Code Size Versus Speed

The latest mechanism for controlling the goal of optimizations in the compiler is represented by the --opt_for_speed=num option. The num denotes the level of optimization (0-5), which controls the type and degree of code size or code speed optimization:

--opt_for_speed=0

Enables optimizations geared towards improving the code size with a high risk of worsening or impacting performance.

--opt_for_speed=1

Enables optimizations geared towards improving the code size with a medium risk of worsening or impacting performance.

--opt_for_speed=2

Enables optimizations geared towards improving the code size with a low risk of worsening or impacting performance.

--opt_for_speed=3

Enables optimizations geared towards improving the code performance/speed with a low risk of worsening or impacting code size.

--opt_for_speed=4

Enables optimizations geared towards improving the code performance/speed with a medium risk of worsening or impacting code size.

--opt_for_speed=5

Enables optimizations geared towards improving the code performance/speed with a high risk of worsening or impacting code size.

If you specify the --opt_for_speed option without a parameter, the default setting is --opt_for_speed=4. If you do not specify the --opt_for_speed option, the default setting is 1

The best performance for caching devices has been observed with --opt_for_speed set to level 1 or 2.

3.13 What Kind of Optimization Is Being Performed?

The ARM C/C++ compiler uses a variety of optimization techniques to improve the execution speed of your C/C++ programs and to reduce their size.

Following are some of the optimizations performed by the compiler:

Optimization	See
Cost-based register allocation	Section 3.13.1
Alias disambiguation	Section 3.13.1
Branch optimizations and control-flow simplification	Section 3.13.3
Data flow optimizations Copy propagation Common subexpression elimination Redundant assignment elimination	Section 3.13.4
Expression simplification	Section 3.13.5
Inline expansion of functions	Section 3.13.6
Function Symbol Aliasing	Section 3.13.7
Induction variable optimizations and strength reduction	Section 3.13.8
Loop-invariant code motion	Section 3.13.9
Loop rotation	Section 3.13.10
Instruction scheduling	Section 3.13.11

ARM-Specific Optimization	See
Tail merging	Section 3.13.12
Autoincrement addressing	Section 3.13.13
Block conditionalizing	Section 3.13.14
Epilog inlining	Section 3.13.15
Removing comparisons to zero	Section 3.13.16
Integer division with constant divisor	Section 3.13.17
Branch chaining	Section 3.13.18

3.13.1 Cost-Based Register Allocation

The compiler, when optimization is enabled, allocates registers to user variables and compiler temporary values according to their type, use, and frequency. Variables used within loops are weighted to have priority over others, and those variables whose uses do not overlap can be allocated to the same register.

Induction variable elimination and loop test replacement allow the compiler to recognize the loop as a simple counting loop and unroll or eliminate the loop. Strength reduction turns the array references into efficient pointer references with autoincrements.

3.13.2 Alias Disambiguation

C and C++ programs generally use many pointer variables. Frequently, compilers are unable to determine whether or not two or more I values (lowercase L: symbols, pointer references, or structure references) refer to the same memory location. This aliasing of memory locations often prevents the compiler from retaining values in registers because it cannot be sure that the register and memory continue to hold the same values over time.

Alias disambiguation is a technique that determines when two pointer expressions cannot point to the same location, allowing the compiler to freely optimize such expressions.

3.13.3 Branch Optimizations and Control-Flow Simplification

The compiler analyzes the branching behavior of a program and rearranges the linear sequences of operations (basic blocks) to remove branches or redundant conditions. Unreachable code is deleted, branches to branches are bypassed, and conditional branches over unconditional branches are simplified to a single conditional branch.

When the value of a condition is determined at compile time (through copy propagation or other data flow analysis), the compiler can delete a conditional branch. Switch case lists are analyzed in the same way as conditional branches and are sometimes eliminated entirely. Some simple control flow constructs are reduced to conditional instructions, totally eliminating the need for branches.

3.13.4 Data Flow Optimizations

Collectively, the following data flow optimizations replace expressions with less costly ones, detect and remove unnecessary assignments, and avoid operations that produce values that are already computed. The compiler with optimization enabled performs these data flow optimizations both locally (within basic blocks) and globally (across entire functions).

Copy propagation. Following an assignment to a variable, the compiler replaces references to the variable with its value. The value can be another variable, a constant, or a common subexpression. This can result in increased opportunities for constant folding, common subexpression elimination, or even total elimination of the variable.
Common subexpression elimination. When two or more expressions produce the same value, the compiler computes the value once, saves it, and reuses it.
Redundant assignment elimination. Often, copy propagation and common subexpression elimination optimizations result in unnecessary assignments to variables (variables with no subsequent reference before another assignment or before the end of the function). The compiler removes these dead assignments.

3.13.5 Expression Simplification

For optimal evaluation, the compiler simplifies expressions into equivalent forms, requiring fewer instructions or registers. Operations between constants are folded into single constants. For example, a = (b + 4) - (c + 1) becomes a = b - c + 3.

3.13.6 Inline Expansion of Functions

The compiler replaces calls to small functions with inline code, saving the overhead associated with a function call as well as providing increased opportunities to apply other optimizations.

3.13.7 Function Symbol Aliasing

The compiler recognizes a function whose definition contains only a call to another function. If the two functions have the same signature (same return value and same number of parameters with the same type, in the same order), then the compiler can make the calling function an alias of the called function.

For example, consider the following:

int bbb(int arg1, char *arg2);

int aaa(int n, char *str)
{
 return bbb(n, str);
}

For this example, the compiler makes aaa an alias of bbb, so that at link time all calls to function aaa should be redirected to bbb. If the linker can successfully redirect all references to aaa, then the body of function aaa can be removed and the symbol aaa is defined at the same address as bbb.

For information about using the GCC function attribute syntax to declare function aliases, see Section 5.16.2

3.13.8 Induction Variables and Strength Reduction

Induction variables are variables whose value within a loop is directly related to the number of executions of the loop. Array indices and control variables for loops are often induction variables.

Strength reduction is the process of replacing inefficient expressions involving induction variables with more efficient expressions. For example, code that indexes into a sequence of array elements is replaced with code that increments a pointer through the array.

Induction variable analysis and strength reduction together often remove all references to your loop-control variable, allowing its elimination.

3.13.9 Loop-Invariant Code Motion

This optimization identifies expressions within loops that always compute to the same value. The computation is moved in front of the loop, and each occurrence of the expression in the loop is replaced by a reference to the precomputed value.

3.13.10 Loop Rotation

The compiler evaluates loop conditionals at the bottom of loops, saving an extra branch out of the loop. In many cases, the initial entry conditional check and the branch are optimized out.

3.13.11 Instruction Scheduling

The compiler performs instruction scheduling, which is the rearranging of machine instructions in such a way that improves performance while maintaining the semantics of the original order. Instruction scheduling is used to improve instruction parallelism and hide latencies. It can also be used to reduce code size.

3.13.12 Tail Merging

If you are optimizing for code size, tail merging can be very effective for some functions. Tail merging finds basic blocks that end in an identical sequence of instructions and have a common destination. If such a set of blocks is found, the sequence of identical instructions is made into its own block. These instructions are then removed from the set of blocks and replaced with branches to the newly created block. Thus, there is only one copy of the sequence of instructions, rather than one for each block in the set.

3.13.13 Autoincrement Addressing

For pointer expressions of the form *p++, the compiler uses efficient ARM autoincrement addressing modes. In many cases, where code steps through an array in a loop such as below, the loop optimizations convert the array references to indirect references through autoincremented register variable pointers.

for (I = 0; I <N; ++I) a(I)...

3.13.14 Block Conditionalizing

Because all 32-bit instructions can be conditional, branches can be removed by conditionalizing instructions.

In Example 3-3, the branch around the add and the branch around the subtract are removed by simply conditionalizing the add and the subtract.

Example 3-3 Block Conditionalizing C Source

int main(int a)
{
 if (a < 0)
 a = a-3;
 else
 a = a*3;

 return ++a;
}

Example 3-4 C/C++ Compiler Output for Example 3-3

;*********************************************************
;* FUNCTION DEF: _main *
;*********************************************************
_main:
 CMP A1, #0
 ADDPL A1, A1, A1, LSL #1
 SUBMI A1, A1, #3
 ADD A1, A1, #1
 BX LR

3.13.15 Epilog Inlining

If the epilog of a function is a single instruction, that instruction replaces all branches to the epilog. This increases execution speed by removing the branch.

3.13.16 Removing Comparisons to Zero

Because most of the 32-bit instructions and some of the 16-bit instructions can modify the status register when the result of their operation is 0, explicit comparisons to 0 may be unnecessary. The ARM C/C++ compiler removes comparisons to 0 if a previous instruction can be modified to set the status register appropriately.

3.13.17 Integer Division With Constant Divisor

The optimizer attempts to rewrite integer divide operations with constant divisors. The integer divides are rewritten as a multiply with the reciprocal of the divisor. This occurs at optimization level 2 (--opt_level=2 or -O2) and higher. You must also compile with the --opt_for_speed option, which selects compile for speed.

3.13.18 Branch Chaining

Branching to branches that jump to the desired target is called branch chaining. Branch chaining is supported in 16-BIS mode only. Consider this code sequence:

LAB1: BR L10
 ....
LAB2: BR L10
 ....
L10:

If L10 is far away from LAB1 (large offset), the assembler converts BR into a sequence of branch around and unconditional branches, resulting in a sequence of two instructions that are either four or six bytes long. Instead, if the branch at LAB1 can jump to LAB2, and LAB2 is close enough that BR can be replaced by a single short branch instruction, the resulting code is smaller as the BR in LAB1 would be converted into one instruction that is two bytes long. LAB2 can in turn jump to another branch if L10 is too far away from LAB2. Thus, branch chaining can be extended to arbitrary depths.

When you compile in thumb mode (--code_state=16) and for code size (--opt_for_speed is not used), the compiler generates two psuedo instructions:

BTcc instead of BRcc. The format is BRcc target, #[depth].

The #depth is an optional argument. If depth is not specified, it is set to the default branch chaining depth. If specified, the chaining depth for this branch instruction is set to #depth. The assembler issues a warning if #depth is less than zero and sets the branch chaining depth for this instruction to zero.

BQcc instead of Bcc. The format is BQcc target , #[depth].

The #depth is the same as for the BTcc psuedo instruction.

The BT pseudo instruction replaces the BR (pseudo branch) instruction. Similarly, BQ replaces B. The assembler performs branch chain optimizations for these instructions, if branch chaining is enabled. The assembler replaces the BT and BQ jump targets with the offset to the branch to which these instructions jump.

The default branch chaining depth is 10. This limit is designed to prevent longer branch chains from impeding performance.

You can the BT and BQ instructions in assembly language programs to enable the assembler to perform branch chaining. You can control the branch chaining depth for each instruction by specifying the (optional) #depth argument. You must use the BR and B instructions to prevent branch chaining for any BT or BQ branches.

Submit Documentation Feedback

Copyright© 2016, Texas Instruments Incorporated. An IMPORTANT NOTICE for this document addresses availability, warranty, changes, use in safety-critical applications, intellectual property matters and other important disclaimers.