This is the course schedule for SPO600 in Winter 2024. It may be adjusted according to the needs of the participants and changes in standards and technology.
Each topic will be linked to relevant notes as the course proceeds.
Note that content is being converted from the previous wiki. There may be links to content which has not yet been converted – these will be imported soon.
Optimization is the process of evaluating different ways that software can be written or built and selecting the option that has the best performance tradeoffs.
Optimization may involve substituting software algorithms, altering the sequence of operations, using architecture-specific code, or altering the build process. It is important to ensure that the optimized software produces correct results and does not cause an unacceptable performance regression for other use-cases, system configurations, operating systems, or architectures.
The definition of “performance” varies according to the target system and the operating goals. For example, in some contexts, low memory or storage usage is important; in other cases, fast operation; and in other cases, low CPU utilization or long battery life may be the most important factor. It is often possible to trade off performance in one area for another; using a lookup table, for example, can reduce CPU utilization and improve battery life in some algorithms, in return for increased memory consumption.
Most advanced compilers perform some level of optimization, and the options selected for compilation can have a significant effect on the trade-offs made by the compiler, affecting memory usage, execution speed, executable size, power consumption, and debuggability.
Benchmarking involves testing software performance under controlled conditions so that the performance can be compared to other software, the same software operating on other types of computers, or so that the impact of a change to the software can be gauged.
Profiling is the process of analyzing software performance on finer scale, determining resource usage per program part (typically per function/method). This can identify software bottlenecks and potential targets for optimization. The resource utilization studies may include memory, CPU cycles/time, or power.
Building software is a complex task that many developers gloss over. The simple act of compiling a program invokes a process with five or more stages, including pre-processing, compiling, optimizing, assembling, and linking. However, a complex software system will have hundreds or even thousands of source files, as well as dozens or hundreds of build configuration options, auto configuration scripts (cmake, autotools), build scripts (such as Makefiles) to coordinate the process, test suites, and more.
The build process varies significantly between software packages. Most software distribution projects (including Linux distributions such as Ubuntu and Fedora) use a packaging system that further wraps the build process in a standardized script format, so that different software packages can be built using a consistent process.
In order to get consistent and comparable benchmark results, you need to ensure that the software is being built in a consistent way. Altering the build process is one way of optimizing software.
Note that the build time for a complex package can range up to hours or even days!
Follow the instructions on the SPO600 Communication Tools page to set up a blog, create SSH keys, and send your blog URLs and public key to me.
I will use this information to:
This updating is done in batches once or twice a week – allow some time!
-f
) and optimization levels (bundles of feature flag controlled by -O
)There is no synchronous (Zoom) class for January 26.
Now it's your turn to experiment with 6502 assembly language and have some fun. The 6502 Math and Strings Lab (Lab 2) gives you a lot of flexibility to chose an interesting mini-project and execute it.
/public/spo600-sve-sve2-ifunc-examples.tgz
on aarch64-001.spo600.cdot.systemsThese are the steps required to build GCC:
git clone git://gcc.gnu.org/git/gcc.git
configure
script in the source directory using a relative path (e.g., ../gcc/configure
). Add a --prefix=dir
option to specify where the software will be installed, plus any other configuration options you want to specify. The dir should be within your home directory, for example $HOME/gcc-test-001/
make
with the -j n
option to specify the maximum number of parallel jobs that should be executed at one time. The value of n should typically be in the range of (number of cores + 1) to (2 * number of cores + 1) depending on the performance characteristics of the machine on which you're building.make install
as a non-root user. Assuming you specified the prefix correctly above, the software should install into subdirectories of the prefix directory, e.g., prefix/bin
, prefix/lib64
, and so forth.PATH=“prefix/bin:$PATH”
There is no need to run any of these steps as the root user, and it is dangerous to run the installation step as the root user, because you could overwrite the system's copy of the software you're installing. Use your regular user account instead.
To build another copy of the same gcc version, perhaps with some code or configuratin changes, you can either repeat the process above with a fresh build directory (start at step 2), or you can run make clean
in your existing build directory and then repeat the process above (start at step 4). Which option you choose will depend on whether you want to keep the previous build for reference.
Tip: Each build takes a lot of disk space (12GB or more in the build directory and 2.7GB or more in the installation directory), so check your available disk space periodically (df -h .
). Delete unneeded builds reguarly. If you're using the class servers and space is getting low, let your professor know and he can adjust the system's storage configuration.
To test your build:
gcc --version
– you should see the version reported as the version you cloned with git (GCC 14.xx.yy) and the build date (immediately after the version number) should match the date on which you build your copy of gcc.The qemu-aarch64 instruction emulator will enable the execution of aarch64 code on any Linux system.
If the system is an aarch64 system, then the majority of the code will run natively on the CPU, and qemu-aarch64 will only handle instructions that are not understood by the system. Therefore, if the CPU is an ARMv8 CPU, and the software is ARMv9 software, then the majority of the instructions will run directly on the CPU and the few instructions that exist in ARMv9 that are not present in ARMv8 (such as SVE2 instructions) will be handled much more slowly by the qemu-aarch64 software. You can use this approach to (for example) run ARMv9 software on the class aarch64 server.
To use qemu-aarch64 on an aarch64 system, place the qemu-aarch64
executable in front of the name of the executable you wish to run:
qemu-aarch64 testprogram ...
However, if the system is an x86_64 system, then the CPU will not be able to execute any of the aarch64 instructions, and all of the instructions will be emulated by the qemu-aarch64 software. That means that the code will execute, but at a fraction of the speed at which it would execute on an actual aarch64 system. However, it will run!
To use qemu-aarch64 on an x86_64 system, you will need the qemu-aarch64 software as well as a full set of userspace files (binaries, libraries, and so forth). You can obtain these from the /public
directory on the class class x86_64 server:
$ ll -h /public/aarch64-f38* -rw-r--r--. 1 chris chris 2.5K Oct 13 10:44 /public/aarch64-f38-root.README -rw-r--r--. 1 chris chris 934M Oct 13 08:33 /public/aarch64-f38-root.tar.xz
The README file contains installation instructions. The tar.xz file contains the userspace, qemu-aarch64 static binary, and a startup script. Note that the tar.xz file is almost 1 GB in size, and will expand to approximately 3.5 GB when uncompressed.
When the tar.xz file is installed on a Linux system using the instructions in the README file, you will have a full aarch64 Fedora 38 Linux system available. The start-aarch64-chroot
script in the top directory of the unpacked archive will start the qemu environment using a chmod
command. Note that this is not a virtual machine – it's a specific group of processes running under the main system.
The /proc
and /sys
filesystems are not mounted by default in the aarch64 chroot. The best way to mount these is to add these lines to the /etc/fstab
file within the chroot:
proc /proc proc defaults 0 0 sysfs /sys sysfs defaults 0 0
You may want to comment out the lines for /boot
and /boot/efi
at the same time.
Once those changes have been made to the /etc/fstab
file, you can mount the additional filesystems with the command:
mount -a
It may also be useful to set wide-open permissions on the /dev/null
device:
chmod a+rw /dev/null
Note that in the chroot environment starts a root shell. You can create other users with the useradd
command, and switch from root to those users with the command su - username
To build GCC in the aarch64 chroot, you will need to install these dependencies (use dnf):
gmp-devel mpfr-devel libmpc-devel libmpc-devel gcc-g++
The Raspberry Pi 4 and 5 utilize aarch64 processors, but are not very fast systems. The Pi 5 is noticably faster than the Pi 4 and is available with more RAM (8 GB).
You can use a Pi4 or a Pi5 to build software. When building code using make
, a jobs value of -j5
is probably optimal.
Using Raspberry Pi OS, you will need to install (at least) these dependencies to build GCC (use apt install
to install them):
gcc make libmpc-dev libgmp-dev libmpfr-dev
Build time for GCC 14 is approximately 168 minutes on a Pi5 with 8GB.
Then run configure
and make
with the usual arguments. Note that SD cards may be slow for storage - consider using an external USB3 SSD or the fastest SD card you can find.
The GCC test suite, distributed with the source code, is based on the DejaGNU framework.
As documented in the notes for the compiler testsuite, you must use the -k
option with make check
:
make -k check
However, in order for this to succeed, the DejaGNU software must be installed on your target system. On Fedora, you can do this with sudo dnf install dejagnu
. On Debian/Ubuntu/Raspberry Pi OS systems, use sudo apt install dejagnu
.
Note that the test suite will take hours to execute, even on a fast system!
It produces a number of files ending in .sum
which summarize the test results (it will also producce other log files - see the documentation). It's a good idea to merge the stdout and stderr of the make
command and redirect that to a log file, too, perhaps like this:
$ time make -k check |& tee make-check.log
We selected and assigned tasks during this class. The task assignments are visible on the Participant Page as well as the Project Page.
We reviewed the goals and approaches of the project.
There are multiple versions of processors of every architecture currently in the market. You can see this when you go into a computer store such as Canada Computers or Best Buy – there are laptops and desktops with processors ranging from Atoms and Celerons to Ryzen 4/7/9 and Core i3/i5/i7/i9 processors, and workstations and servers with processors ranging up to Xeon and Epyc/Threadripper devices. Similarly, cellphones range from devices with Cortex-A35 cores through Neoverse X3 cores.
These wide range of devices support a diverse range of processor features.
Software developers (and vendors) are caught between supporting only the latest hardware, which limits the market they can sell to, or else harming the performance of their software by not taking advantage of recent processor improvements. Neither option is attractive for a software company wishing to be competitive.
To take good advantage of processor features with minimal effort by the software developers.
There are three solutions in various stages of preparation, each of which builds upon the previous solutions:
GCC IFUNC documenation:
Current documentation:
target_clones
syntax__HAVE_FEATURE_MULTI_VERSIONING
(or __FEATURE_FUNCTION_MULTI_VERSIONING
or __ARM_FEATURE_FUNCTION_MULTIVERSIONING
) does not appear to be definedImplementation in GCC
__attribue__((target("nnn")))
- where nnn
may take the form of “default”, or “feature” eg., “sse4.2”, or “feature,feature” e.g., “sse4.2,avx2”, or it may take the form “arch=archlevel” e.g., “arch=x86-64-v3” or “arch=atom”__attribute__((target_clone("nnn1", "nnn2" [...])))
__attribute__((target_version("nnn")))
- where nnn
may take the form of “default”, or “feature” e.g., “sve”, or “feature+feature” e.g., “sve+sve2” (Note: in some earlier versions of GCC, a plus-sign was required at the start of the feature list, e.g., “+sve” instead of “sve”. This was changed by gcc 14). Note the use of the attribute target_version as opposed to target (as used on x86) which according to the ACLE . Note that the “arch=nnn” format is not supported.__attribute__((target_clone("nnn", "nnn" [...])))
- note that contrary to some of the documentation, there is no automatic “default” argument - the first argument supplied should be “default”