Table of Contents

SPO600

This is the course schedule for SPO600 in Winter 2024. It may be adjusted according to the needs of the participants and changes in standards and technology.

Each topic will be linked to relevant notes as the course proceeds.

Current Participants

See SPO600 2024 Winter Participants

Course Notes

Note that content is being converted from the previous wiki. There may be links to content which has not yet been converted – these will be imported soon.

Week 1

Week 1 - Class I

Video

General Course Information

About SPO600 Classes

Introduction to the Problems

Porting and Portability
Optimization

Optimization is the process of evaluating different ways that software can be written or built and selecting the option that has the best performance tradeoffs.

Optimization may involve substituting software algorithms, altering the sequence of operations, using architecture-specific code, or altering the build process. It is important to ensure that the optimized software produces correct results and does not cause an unacceptable performance regression for other use-cases, system configurations, operating systems, or architectures.

The definition of “performance” varies according to the target system and the operating goals. For example, in some contexts, low memory or storage usage is important; in other cases, fast operation; and in other cases, low CPU utilization or long battery life may be the most important factor. It is often possible to trade off performance in one area for another; using a lookup table, for example, can reduce CPU utilization and improve battery life in some algorithms, in return for increased memory consumption.

Most advanced compilers perform some level of optimization, and the options selected for compilation can have a significant effect on the trade-offs made by the compiler, affecting memory usage, execution speed, executable size, power consumption, and debuggability.

Benchmarking and Profiling

Benchmarking involves testing software performance under controlled conditions so that the performance can be compared to other software, the same software operating on other types of computers, or so that the impact of a change to the software can be gauged.

Profiling is the process of analyzing software performance on finer scale, determining resource usage per program part (typically per function/method). This can identify software bottlenecks and potential targets for optimization. The resource utilization studies may include memory, CPU cycles/time, or power.

Build Process

Building software is a complex task that many developers gloss over. The simple act of compiling a program invokes a process with five or more stages, including pre-processing, compiling, optimizing, assembling, and linking. However, a complex software system will have hundreds or even thousands of source files, as well as dozens or hundreds of build configuration options, auto configuration scripts (cmake, autotools), build scripts (such as Makefiles) to coordinate the process, test suites, and more.

The build process varies significantly between software packages. Most software distribution projects (including Linux distributions such as Ubuntu and Fedora) use a packaging system that further wraps the build process in a standardized script format, so that different software packages can be built using a consistent process.

In order to get consistent and comparable benchmark results, you need to ensure that the software is being built in a consistent way. Altering the build process is one way of optimizing software.

Note that the build time for a complex package can range up to hours or even days!

Course Setup

Follow the instructions on the SPO600 Communication Tools page to set up a blog, create SSH keys, and send your blog URLs and public key to me.

I will use this information to:

  1. Update the Current SPO600 Participants page with your information, and
  2. Create an account for you on the SPO600 Servers.

This updating is done in batches once or twice a week – allow some time!

Week 1 - Class II

Video

6502 Assembly

Lab 1

Week 1 Deliverables

  1. Perform Lab 1 and blog your results

Week 2

Week 2 - Class I

Video

Compilers: Standard Optimizations and Feature Flags

Week 2 - Class II

Video

6502 Math and Jumps, Branches, and Procedures

Week 2 Deliverables

Week 3

Week 3 - Class I

Video

Compilers: Targets and Tuning

Week 3 - Class II

There is no synchronous (Zoom) class for January 26.

Video

Lab

Now it's your turn to experiment with 6502 assembly language and have some fun. The 6502 Math and Strings Lab (Lab 2) gives you a lot of flexibility to chose an interesting mini-project and execute it.

Week 3 Deliverables

Week 4

Week 4 - Class I

Video

Resources

Experimentation

Week 4 - Class II

Video

Resources

Week 4 Deliverables

Week 5

Week 5 - Class I

Video

Lab 3

Week 5 - Class II

Video

Week 5 Deliverables

Week 6

Week 6 - Class I

Video

Code Examples

Week 6 Deliverables

Week 7

Video

Building GCC

These are the steps required to build GCC:

  1. Obtain the source by anonymously pulling from the main branch of the git repository: git clone git://gcc.gnu.org/git/gcc.git
  2. Create an empty build directory in which to build the software. This should not be inside the source tree; a good place to put it is beside the source tree.
  3. Change your working directory to the build directory.
  4. Perform this step ONLY as your regular, non-root user. Run the configure script in the source directory using a relative path (e.g., ../gcc/configure). Add a --prefix=dir option to specify where the software will be installed, plus any other configuration options you want to specify. The dir should be within your home directory, for example $HOME/gcc-test-001/
  5. Run make with the -j n option to specify the maximum number of parallel jobs that should be executed at one time. The value of n should typically be in the range of (number of cores + 1) to (2 * number of cores + 1) depending on the performance characteristics of the machine on which you're building.
  6. Run make install as a non-root user. Assuming you specified the prefix correctly above, the software should install into subdirectories of the prefix directory, e.g., prefix/bin, prefix/lib64, and so forth.
  7. Add your bin directory to your PATH: PATH=“prefix/bin:$PATH”

There is no need to run any of these steps as the root user, and it is dangerous to run the installation step as the root user, because you could overwrite the system's copy of the software you're installing. Use your regular user account instead.

To build another copy of the same gcc version, perhaps with some code or configuratin changes, you can either repeat the process above with a fresh build directory (start at step 2), or you can run make clean in your existing build directory and then repeat the process above (start at step 4). Which option you choose will depend on whether you want to keep the previous build for reference.

Tip: Each build takes a lot of disk space (12GB or more in the build directory and 2.7GB or more in the installation directory), so check your available disk space periodically (df -h .). Delete unneeded builds reguarly. If you're using the class servers and space is getting low, let your professor know and he can adjust the system's storage configuration.

Testing Your Build

To test your build:

Week 8

Week 8 - Class I

Video

Project

Week 8 - Class II

Video

Week 8 Deliverables

Week 9

Week 9 - Class I

Video

Using AArch64 Software on an x86_64 System

The qemu-aarch64 instruction emulator will enable the execution of aarch64 code on any Linux system.

If the system is an aarch64 system, then the majority of the code will run natively on the CPU, and qemu-aarch64 will only handle instructions that are not understood by the system. Therefore, if the CPU is an ARMv8 CPU, and the software is ARMv9 software, then the majority of the instructions will run directly on the CPU and the few instructions that exist in ARMv9 that are not present in ARMv8 (such as SVE2 instructions) will be handled much more slowly by the qemu-aarch64 software. You can use this approach to (for example) run ARMv9 software on the class aarch64 server.

To use qemu-aarch64 on an aarch64 system, place the qemu-aarch64 executable in front of the name of the executable you wish to run:

qemu-aarch64 testprogram ...

However, if the system is an x86_64 system, then the CPU will not be able to execute any of the aarch64 instructions, and all of the instructions will be emulated by the qemu-aarch64 software. That means that the code will execute, but at a fraction of the speed at which it would execute on an actual aarch64 system. However, it will run!

To use qemu-aarch64 on an x86_64 system, you will need the qemu-aarch64 software as well as a full set of userspace files (binaries, libraries, and so forth). You can obtain these from the /public directory on the class class x86_64 server:

$ ll -h /public/aarch64-f38*
-rw-r--r--. 1 chris chris 2.5K Oct 13 10:44 /public/aarch64-f38-root.README
-rw-r--r--. 1 chris chris 934M Oct 13 08:33 /public/aarch64-f38-root.tar.xz

The README file contains installation instructions. The tar.xz file contains the userspace, qemu-aarch64 static binary, and a startup script. Note that the tar.xz file is almost 1 GB in size, and will expand to approximately 3.5 GB when uncompressed.

When the tar.xz file is installed on a Linux system using the instructions in the README file, you will have a full aarch64 Fedora 38 Linux system available. The start-aarch64-chroot script in the top directory of the unpacked archive will start the qemu environment using a chmod command. Note that this is not a virtual machine – it's a specific group of processes running under the main system.

The /proc and /sys filesystems are not mounted by default in the aarch64 chroot. The best way to mount these is to add these lines to the /etc/fstab file within the chroot:

proc	/proc  proc    defaults 0 0
sysfs /sys   sysfs   defaults 0 0

You may want to comment out the lines for /boot and /boot/efi at the same time.

Once those changes have been made to the /etc/fstab file, you can mount the additional filesystems with the command:

mount -a

It may also be useful to set wide-open permissions on the /dev/null device:

chmod a+rw /dev/null

Note that in the chroot environment starts a root shell. You can create other users with the useradd command, and switch from root to those users with the command su - username

To build GCC in the aarch64 chroot, you will need to install these dependencies (use dnf):

gmp-devel
mpfr-devel
libmpc-devel
libmpc-devel
gcc-g++

Using a Raspberry Pi 4 or 5

The Raspberry Pi 4 and 5 utilize aarch64 processors, but are not very fast systems. The Pi 5 is noticably faster than the Pi 4 and is available with more RAM (8 GB).

You can use a Pi4 or a Pi5 to build software. When building code using make, a jobs value of -j5 is probably optimal.

Using Raspberry Pi OS, you will need to install (at least) these dependencies to build GCC (use apt install to install them):

gcc
make
libmpc-dev
libgmp-dev
libmpfr-dev

Build time for GCC 14 is approximately 168 minutes on a Pi5 with 8GB.

Then run configure and make with the usual arguments. Note that SD cards may be slow for storage - consider using an external USB3 SSD or the fastest SD card you can find.

Using Make Check on GCC

The GCC test suite, distributed with the source code, is based on the DejaGNU framework.

As documented in the notes for the compiler testsuite, you must use the -k option with make check:

make -k check

However, in order for this to succeed, the DejaGNU software must be installed on your target system. On Fedora, you can do this with sudo dnf install dejagnu. On Debian/Ubuntu/Raspberry Pi OS systems, use sudo apt install dejagnu.

Note that the test suite will take hours to execute, even on a fast system!

It produces a number of files ending in .sum which summarize the test results (it will also producce other log files - see the documentation). It's a good idea to merge the stdout and stderr of the make command and redirect that to a log file, too, perhaps like this:

$ time make -k check |& tee make-check.log

Week 9 - Class II

Video

Week 9 Deliverables

Week 10

Week 10 - Class I

Video

Task Selection

We selected and assigned tasks during this class. The task assignments are visible on the Participant Page as well as the Project Page.

Week 10 - Class II

Video

Week 10 Deliverables

Week 11

Week 11 - Class I

Week 11 Deliverables

Week 12

Week 12 - Class I

Project Review

We reviewed the goals and approaches of the project.

The Problem

There are multiple versions of processors of every architecture currently in the market. You can see this when you go into a computer store such as Canada Computers or Best Buy – there are laptops and desktops with processors ranging from Atoms and Celerons to Ryzen 4/7/9 and Core i3/i5/i7/i9 processors, and workstations and servers with processors ranging up to Xeon and Epyc/Threadripper devices. Similarly, cellphones range from devices with Cortex-A35 cores through Neoverse X3 cores.

These wide range of devices support a diverse range of processor features.

Software developers (and vendors) are caught between supporting only the latest hardware, which limits the market they can sell to, or else harming the performance of their software by not taking advantage of recent processor improvements. Neither option is attractive for a software company wishing to be competitive.

The Goal

To take good advantage of processor features with minimal effort by the software developers.

Three Solutions

There are three solutions in various stages of preparation, each of which builds upon the previous solutions:

  1. IFUNC - Indirect Functions - This is a solution provided by the development toolchain (compiler, linker, libraries) but which is largely manual for the software developer. The developer provides multiple alternate versions of performance-critical functions which are targeted at different micro-architectural levels, plus a resolver function that selects between the implementations at runtime. Note that IFUNC is the only solution which enables a resolver function that takes into account factors other than the micro-architectural level of the processor. For example, a resolver function could select beween alternate functions based on available memory, storage performance, or the speed of the network connection.
  2. FMV - Function Multi-Versioning - This is a solution that is also supported by the development toolchain but which involves slightly less manual work for the developer. There are two levels of FMV:
    1. FMV with Manual Alternate Functions - The programmer provides the alternate functions and uses function attributes to specify the microarchitectural level at which each is targeted. The resolver function for each group of alternate functions is automatically generated.
    2. FMV with Cloned Functions - The program provides one version of the function and uses function attributes to specify that clones of that function are to be built, and the micro-architectural targets for each clone. The resolver function for each group of cloned functions is automatically generated. The only difference between the cloned functions is the micro-architectural optimizations that are applied by the compiler. Note that there is nothing to ensure that the clones are actually any better or in fact different from each other.
  3. AFMV - Automatic Function Multi-Versioning - This is what we're working on - This is effectively FMV with Cloned Functions, but the cloning is controlled from the command line rather than using function attributes. This has the advantage that no source changes are required. Every function in the program is cloned, and the after the various optimization passes have been applied, the cloned functions are analyzed. If the functions are different, they are kept, but if they are idential, they are removed, and only the default version of the function is used.
Specifics: IFUNC

GCC IFUNC documenation:

Specifics: FMV

Current documentation:

1. GCC documentation

2. ARM ACLE documentation

Implementation in GCC

Week 11 Deliverables