Table of Contents
SPO600
This is the course schedule for SPO600 in Winter 2024. It may be adjusted according to the needs of the participants and changes in standards and technology.
Each topic will be linked to relevant notes as the course proceeds.
Current Participants
Course Notes
Note that content is being converted from the previous wiki. There may be links to content which has not yet been converted – these will be imported soon.
Week 1
Week 1 - Class I
Video
General Course Information
- Course resources are linked from the CDOT wiki, starting at https://wiki.cdot.senecacollege.ca/wiki/SPO600 (Quick find: This page will usually be Google's top result for a search on “SPO600”), arranged by week and class. There will be lots of hyperlinks – be sure to follow these links.
- Coursework is submitted by blogging. The only exception to this is quizzes.
- Quizzes will be short (~1 page) and will be held without announcement at the start of any synchronous class. There is no opportunity to re-take a missed quiz, but your lowest three quiz scores will not be counted, so do not worry if you miss one or two.
- Students with test accommodations: an alternate monthly quiz can be made available via the Test Centre. Communicate with your professor for details.
- Course marks (see Weekly Schedule for dates):
- 60% - Project Deliverables in three phases (15%, 20%, 25%)
- 20% - Communication (Blog writing, in four phases roughly a month long each, 5% each)
- 20% - Labs and Quizzes (10% labs; 10% for quizzes - lowest 3 quiz scores not counted)
About SPO600 Classes
- Online Classes
- Wednesday and Friday 10:45-12:30 pm
- A summary video will be posted on a best-effort basis (technical issues may prevent posting in some cases). The summary video will be edited and may not include some discussion and questions that take place in the class. The link(s) to the video(s) will be posted on this page under the corresponding date.
- It is strongly recommended that you attend the online sessions and take notes.
- Pre-recorded Classes
- From time to time, an online class may be replaced by pre-recorded videos. The links will be provided on this page under the corresponding date.
Introduction to the Problems
Porting and Portability
- Most software is written in a high-level language which can be compiled into machine code for a specific computer architecture. In many cases, this code can be compiled or interpreted for execution on multiple computer architectures - this is called 'portable' code. However, there is a lot of existing code that contains some architecture-specific code fragments which contains assumptions about the architecture, resulting in architecture-specific high-level or Assembly Language code.
- Reasons that code is architecture-specific:
- System assumptions that don't hold true on other platforms
- Variable or word size
- Code that takes advantage of platform-specific features
- Reasons for writing code in Assembly Language include:
- Performance
- Direct access to hardware features, e.g., CPUID registers
- Most of the historical reasons for using assembler are no longer valid. Modern compilers can out-perform most hand-optimized assembly code, atomic operations can be handled by libraries or compiler intrinsics, and most hardware access should be performed through the operating system or appropriate libraries.
- A new architecture has appeared: AArch64, which is part of ARMv8. This is the first new computer architecture to appear in several years (at least, the first mainstream computer architecture).
- At this point, most key open source software (the software typically present in a Linux distribution such as Ubuntu or Fedora, for example) now runs on AArch64. However, it may not yet be as extensively optimized as on older architectures (such as x86_64).
Optimization
Optimization is the process of evaluating different ways that software can be written or built and selecting the option that has the best performance tradeoffs.
Optimization may involve substituting software algorithms, altering the sequence of operations, using architecture-specific code, or altering the build process. It is important to ensure that the optimized software produces correct results and does not cause an unacceptable performance regression for other use-cases, system configurations, operating systems, or architectures.
The definition of “performance” varies according to the target system and the operating goals. For example, in some contexts, low memory or storage usage is important; in other cases, fast operation; and in other cases, low CPU utilization or long battery life may be the most important factor. It is often possible to trade off performance in one area for another; using a lookup table, for example, can reduce CPU utilization and improve battery life in some algorithms, in return for increased memory consumption.
Most advanced compilers perform some level of optimization, and the options selected for compilation can have a significant effect on the trade-offs made by the compiler, affecting memory usage, execution speed, executable size, power consumption, and debuggability.
Benchmarking and Profiling
Benchmarking involves testing software performance under controlled conditions so that the performance can be compared to other software, the same software operating on other types of computers, or so that the impact of a change to the software can be gauged.
Profiling is the process of analyzing software performance on finer scale, determining resource usage per program part (typically per function/method). This can identify software bottlenecks and potential targets for optimization. The resource utilization studies may include memory, CPU cycles/time, or power.
Build Process
Building software is a complex task that many developers gloss over. The simple act of compiling a program invokes a process with five or more stages, including pre-processing, compiling, optimizing, assembling, and linking. However, a complex software system will have hundreds or even thousands of source files, as well as dozens or hundreds of build configuration options, auto configuration scripts (cmake, autotools), build scripts (such as Makefiles) to coordinate the process, test suites, and more.
The build process varies significantly between software packages. Most software distribution projects (including Linux distributions such as Ubuntu and Fedora) use a packaging system that further wraps the build process in a standardized script format, so that different software packages can be built using a consistent process.
In order to get consistent and comparable benchmark results, you need to ensure that the software is being built in a consistent way. Altering the build process is one way of optimizing software.
Note that the build time for a complex package can range up to hours or even days!
Course Setup
Follow the instructions on the SPO600 Communication Tools page to set up a blog, create SSH keys, and send your blog URLs and public key to me.
I will use this information to:
- Update the Current SPO600 Participants page with your information, and
- Create an account for you on the SPO600 Servers.
This updating is done in batches once or twice a week – allow some time!
Week 1 - Class II
Video
- MS Streams link: Summary Video - January 12, 2024 class
6502 Assembly
- 6502 - Basic information about the processor
Lab 1
Week 1 Deliverables
- Set up your SPO600 Communication Tools
- Perform Lab 1 and blog your results
Week 2
Week 2 - Class I
Video
Compilers: Standard Optimizations and Feature Flags
- Compiler Optimizations including a brief discussion of feature flags (
-f
) and optimization levels (bundles of feature flag controlled by-O
)
Week 2 - Class II
Video
- An edited summary video covering 6502 Math / Jumps, Branches
6502 Math and Jumps, Branches, and Procedures
Week 2 Deliverables
- Your Communication Tools should be set up by now!
- Complete your Lab 1
Week 3
Week 3 - Class I
Video
- An edited summary video covering Compiler Targets and Tuning
Compilers: Targets and Tuning
- Resources
- x86_64 psABI specification
- ARM ABI specifications
Week 3 - Class II
There is no synchronous (Zoom) class for January 26.
Video
- 6502 String Basics (26 minutes)
- 6502 String Input (72 minutes) Note: references in this video to Lab 3 in a previous semester are relevant to this semester's Lab 2, but the original code requirement has been increased from 25% to 75%.
- A 6502 Assembly Language Hack (optional) (5 minutes)
Lab
Now it's your turn to experiment with 6502 assembly language and have some fun. The 6502 Math and Strings Lab (Lab 2) gives you a lot of flexibility to chose an interesting mini-project and execute it.
Week 3 Deliverables
Week 4
Week 4 - Class I
Video
Resources
Experimentation
- Make sure you can login to both of the SPO600 Servers
- Build and test the iFunc demo code (https://github.com/ctyler/ifunc-aarch64-demo) on the AArch64 server
Week 4 - Class II
Video
Resources
- ELF file format
- ARM 64-bit CPU Instruction Set and Software Developer Manuals
- ARM Aarch64 documentation
- The short guide to the ARMv8 instruction set: ARMv8 Instruction Set Overview (“ARM ISA Overview”)
- The long guide to the ARMv8 instruction set: ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile (“ARM ARM”)
- x86_64 Documentation
- AMD Developer Guide and Manuals(see the AMD64 Architecture section, particularly the AMD64 Architecture Programmer’s Manual Volume 3: General Purpose and System Instructions)
- GAS Manual - Using as, The GNU Assembler: https://sourceware.org/binutils/docs/as/
Week 4 Deliverables
- Complete your Lab 2
- Test your ability to access both SPO600 Servers
- Blog about your test of the iFunc demo code
- Ensure that your blog is ready for marking by the end of the weekend (February 4, 11:59 pm)
Week 5
Week 5 - Class I
Video
- Edited summary video: 64-Bit Assembly, Part 2
Lab 3
- 64-Bit Assembly Language Lab (Lab 3)
- We performed steps 1-3 of the lab
- Steps 4 onward are for you to do
Week 5 - Class II
Video
Week 5 Deliverables
- Perform Lab 3 and blog your results
Week 6
Week 6 - Class I
Video
- Multiple Micro-Architectures and iFunc (30 minutes) (Note: this video was narrated via TTS)
- A video from a previous semester for background: SVE and SVE2 (85 minutes)
Code Examples
- The code used in the video is available in the directory
/public/spo600-sve-sve2-ifunc-examples.tgz
on aarch64-001.spo600.cdot.systems
Week 6 Deliverables
- Lab 4 will be released on Friday.
Week 7
Video
Building GCC
These are the steps required to build GCC:
- Obtain the source by anonymously pulling from the main branch of the git repository:
git clone git://gcc.gnu.org/git/gcc.git
- Create an empty build directory in which to build the software. This should not be inside the source tree; a good place to put it is beside the source tree.
- Change your working directory to the build directory.
- Perform this step ONLY as your regular, non-root user. Run the
configure
script in the source directory using a relative path (e.g.,../gcc/configure
). Add a--prefix=dir
option to specify where the software will be installed, plus any other configuration options you want to specify. The dir should be within your home directory, for example$HOME/gcc-test-001/
- Run
make
with the-j n
option to specify the maximum number of parallel jobs that should be executed at one time. The value of n should typically be in the range of (number of cores + 1) to (2 * number of cores + 1) depending on the performance characteristics of the machine on which you're building. - Run
make install
as a non-root user. Assuming you specified the prefix correctly above, the software should install into subdirectories of the prefix directory, e.g.,prefix/bin
,prefix/lib64
, and so forth. - Add your bin directory to your PATH:
PATH=“prefix/bin:$PATH”
There is no need to run any of these steps as the root user, and it is dangerous to run the installation step as the root user, because you could overwrite the system's copy of the software you're installing. Use your regular user account instead.
To build another copy of the same gcc version, perhaps with some code or configuratin changes, you can either repeat the process above with a fresh build directory (start at step 2), or you can run make clean
in your existing build directory and then repeat the process above (start at step 4). Which option you choose will depend on whether you want to keep the previous build for reference.
Tip: Each build takes a lot of disk space (12GB or more in the build directory and 2.7GB or more in the installation directory), so check your available disk space periodically (df -h .
). Delete unneeded builds reguarly. If you're using the class servers and space is getting low, let your professor know and he can adjust the system's storage configuration.
Testing Your Build
To test your build:
- Having altered your PATH as noted above,verify the version of GCC that you're using by running:
gcc --version
– you should see the version reported as the version you cloned with git (GCC 14.xx.yy) and the build date (immediately after the version number) should match the date on which you build your copy of gcc. - Optionally, run the compiler testsuite.
- Verify that the compiler operates correctly by using it to build your code. Ideally, test some features that should be present in the new version that won't be present in the system-installed copy of gcc. Remember that “gcc” stands for “GNU Compiler Collection” and includes not just the gcc C compiler, but also the g++ C++ compiler and compilers for other languages, plus supporting libraries and tools.
Week 8
Week 8 - Class I
Video
Project
Week 8 - Class II
Video
Week 8 Deliverables
- Blog about your Project Stage 1 work.
Week 9
Week 9 - Class I
Video
Using AArch64 Software on an x86_64 System
The qemu-aarch64 instruction emulator will enable the execution of aarch64 code on any Linux system.
If the system is an aarch64 system, then the majority of the code will run natively on the CPU, and qemu-aarch64 will only handle instructions that are not understood by the system. Therefore, if the CPU is an ARMv8 CPU, and the software is ARMv9 software, then the majority of the instructions will run directly on the CPU and the few instructions that exist in ARMv9 that are not present in ARMv8 (such as SVE2 instructions) will be handled much more slowly by the qemu-aarch64 software. You can use this approach to (for example) run ARMv9 software on the class aarch64 server.
To use qemu-aarch64 on an aarch64 system, place the qemu-aarch64
executable in front of the name of the executable you wish to run:
qemu-aarch64 testprogram ...
However, if the system is an x86_64 system, then the CPU will not be able to execute any of the aarch64 instructions, and all of the instructions will be emulated by the qemu-aarch64 software. That means that the code will execute, but at a fraction of the speed at which it would execute on an actual aarch64 system. However, it will run!
To use qemu-aarch64 on an x86_64 system, you will need the qemu-aarch64 software as well as a full set of userspace files (binaries, libraries, and so forth). You can obtain these from the /public
directory on the class class x86_64 server:
$ ll -h /public/aarch64-f38* -rw-r--r--. 1 chris chris 2.5K Oct 13 10:44 /public/aarch64-f38-root.README -rw-r--r--. 1 chris chris 934M Oct 13 08:33 /public/aarch64-f38-root.tar.xz
The README file contains installation instructions. The tar.xz file contains the userspace, qemu-aarch64 static binary, and a startup script. Note that the tar.xz file is almost 1 GB in size, and will expand to approximately 3.5 GB when uncompressed.
When the tar.xz file is installed on a Linux system using the instructions in the README file, you will have a full aarch64 Fedora 38 Linux system available. The start-aarch64-chroot
script in the top directory of the unpacked archive will start the qemu environment using a chmod
command. Note that this is not a virtual machine – it's a specific group of processes running under the main system.
The /proc
and /sys
filesystems are not mounted by default in the aarch64 chroot. The best way to mount these is to add these lines to the /etc/fstab
file within the chroot:
proc /proc proc defaults 0 0 sysfs /sys sysfs defaults 0 0
You may want to comment out the lines for /boot
and /boot/efi
at the same time.
Once those changes have been made to the /etc/fstab
file, you can mount the additional filesystems with the command:
mount -a
It may also be useful to set wide-open permissions on the /dev/null
device:
chmod a+rw /dev/null
Note that in the chroot environment starts a root shell. You can create other users with the useradd
command, and switch from root to those users with the command su - username
To build GCC in the aarch64 chroot, you will need to install these dependencies (use dnf):
gmp-devel mpfr-devel libmpc-devel libmpc-devel gcc-g++
Using a Raspberry Pi 4 or 5
The Raspberry Pi 4 and 5 utilize aarch64 processors, but are not very fast systems. The Pi 5 is noticably faster than the Pi 4 and is available with more RAM (8 GB).
You can use a Pi4 or a Pi5 to build software. When building code using make
, a jobs value of -j5
is probably optimal.
Using Raspberry Pi OS, you will need to install (at least) these dependencies to build GCC (use apt install
to install them):
gcc make libmpc-dev libgmp-dev libmpfr-dev
Build time for GCC 14 is approximately 168 minutes on a Pi5 with 8GB.
Then run configure
and make
with the usual arguments. Note that SD cards may be slow for storage - consider using an external USB3 SSD or the fastest SD card you can find.
Using Make Check on GCC
The GCC test suite, distributed with the source code, is based on the DejaGNU framework.
As documented in the notes for the compiler testsuite, you must use the -k
option with make check
:
make -k check
However, in order for this to succeed, the DejaGNU software must be installed on your target system. On Fedora, you can do this with sudo dnf install dejagnu
. On Debian/Ubuntu/Raspberry Pi OS systems, use sudo apt install dejagnu
.
Note that the test suite will take hours to execute, even on a fast system!
It produces a number of files ending in .sum
which summarize the test results (it will also producce other log files - see the documentation). It's a good idea to merge the stdout and stderr of the make
command and redirect that to a log file, too, perhaps like this:
$ time make -k check |& tee make-check.log
Week 9 - Class II
Video
- Summary Video (pending)
Week 9 Deliverables
- Project Stage 1
Week 10
Week 10 - Class I
Video
- Summary Video (pending)
Task Selection
We selected and assigned tasks during this class. The task assignments are visible on the Participant Page as well as the Project Page.
Week 10 - Class II
Video
- Summary Video (pending)
Week 10 Deliverables
- Blog about your project work
Week 11
Week 11 - Class I
Week 11 Deliverables
- Blog about your Project work
Week 12
Week 12 - Class I
Project Review
We reviewed the goals and approaches of the project.
The Problem
There are multiple versions of processors of every architecture currently in the market. You can see this when you go into a computer store such as Canada Computers or Best Buy – there are laptops and desktops with processors ranging from Atoms and Celerons to Ryzen 4/7/9 and Core i3/i5/i7/i9 processors, and workstations and servers with processors ranging up to Xeon and Epyc/Threadripper devices. Similarly, cellphones range from devices with Cortex-A35 cores through Neoverse X3 cores.
These wide range of devices support a diverse range of processor features.
Software developers (and vendors) are caught between supporting only the latest hardware, which limits the market they can sell to, or else harming the performance of their software by not taking advantage of recent processor improvements. Neither option is attractive for a software company wishing to be competitive.
The Goal
To take good advantage of processor features with minimal effort by the software developers.
Three Solutions
There are three solutions in various stages of preparation, each of which builds upon the previous solutions:
- IFUNC - Indirect Functions - This is a solution provided by the development toolchain (compiler, linker, libraries) but which is largely manual for the software developer. The developer provides multiple alternate versions of performance-critical functions which are targeted at different micro-architectural levels, plus a resolver function that selects between the implementations at runtime. Note that IFUNC is the only solution which enables a resolver function that takes into account factors other than the micro-architectural level of the processor. For example, a resolver function could select beween alternate functions based on available memory, storage performance, or the speed of the network connection.
- FMV - Function Multi-Versioning - This is a solution that is also supported by the development toolchain but which involves slightly less manual work for the developer. There are two levels of FMV:
- FMV with Manual Alternate Functions - The programmer provides the alternate functions and uses function attributes to specify the microarchitectural level at which each is targeted. The resolver function for each group of alternate functions is automatically generated.
- FMV with Cloned Functions - The program provides one version of the function and uses function attributes to specify that clones of that function are to be built, and the micro-architectural targets for each clone. The resolver function for each group of cloned functions is automatically generated. The only difference between the cloned functions is the micro-architectural optimizations that are applied by the compiler. Note that there is nothing to ensure that the clones are actually any better or in fact different from each other.
- AFMV - Automatic Function Multi-Versioning - This is what we're working on - This is effectively FMV with Cloned Functions, but the cloning is controlled from the command line rather than using function attributes. This has the advantage that no source changes are required. Every function in the program is cloned, and the after the various optimization passes have been applied, the cloned functions are analyzed. If the functions are different, they are kept, but if they are idential, they are removed, and only the default version of the function is used.
Specifics: IFUNC
GCC IFUNC documenation:
Specifics: FMV
Current documentation:
- Mentions that FMV is only implemented for i386 (aka x86_32) - now false as it's also implemented for x86_64, aarch64, and Power
- Does not mention
target_clones
syntax
- Does not talk about the current state of implementation
- Mentions that FMV may be disabled at compile time by a compiler flag, but this flag is not documented
- The macro
__HAVE_FEATURE_MULTI_VERSIONING
(or__FEATURE_FUNCTION_MULTI_VERSIONING
or__ARM_FEATURE_FUNCTION_MULTIVERSIONING
) does not appear to be defined
Implementation in GCC
- Implemented and tested in (at least) x86_64, PowerPC4, and AArch64
- I did not test the PowerPC version
- Testing performed with GCC 14.0.1 20240223
- On x86:
- Syntax to manually specify a function target:
__attribue__((target("nnn")))
- wherennn
may take the form of “default”, or “feature” eg., “sse4.2”, or “feature,feature” e.g., “sse4.2,avx2”, or it may take the form “arch=archlevel” e.g., “arch=x86-64-v3” or “arch=atom” - target_version is not accepted as an alternative to target attribute
- Syntax to manually specify cloning:
__attribute__((target_clone("nnn1", "nnn2" [...])))
- Works in both the C and C++ frontends
- On AArch64:
- Current support landed Dec 16, 2023
- Syntax to manually specify a function target:
__attribute__((target_version("nnn")))
- wherennn
may take the form of “default”, or “feature” e.g., “sve”, or “feature+feature” e.g., “sve+sve2” (Note: in some earlier versions of GCC, a plus-sign was required at the start of the feature list, e.g., “+sve” instead of “sve”. This was changed by gcc 14). Note the use of the attribute target_version as opposed to target (as used on x86) which according to the ACLE . Note that the “arch=nnn” format is not supported. - Syntax to manually specify cloning:
__attribute__((target_clone("nnn", "nnn" [...])))
- note that contrary to some of the documentation, there is no automatic “default” argument - the first argument supplied should be “default” - Manually specified function target works in the C++ frontend only, but automatic cloning appears to work in both C and C++. Note that most C code can be compiled with the C++ frontend, except for some recent C enhancements not understood by C++ as well as some C++ keywords that are not reserved in C
Week 11 Deliverables
- Submit your Project Stage 2