SPO600

This is the course schedule for SPO600 in Winter 2024. It may be adjusted according to the needs of the participants and changes in standards and technology.

Each topic will be linked to relevant notes as the course proceeds.

Week	Week of…	Class I (Tuesday)	Class II (Friday)	Deliverables
Week 1	January 8	Portability and Optimization - Introduction to the Problem	Assembly: 6502 - Basics	Set up communication tools and perform Lab 1
Week 2	January 15	Compilers: Standard Optimizations, Feature Flags	Assembly: 6502 - Math and Jumps, Branches, and Procedures	Finish Lab 1
Week 3	January 22	Compilers: Architecture Targets and Tuning	Assembly: 6502 Strings	Work on Lab 2
Week 4	January 29	Compilers: iFunc	Assembly: Introduction to 64 Bit	Lab 2; January blog posts; Server access
Week 5	February 5	64 Bit Assembly, Continued	SIMD	Lab 3
Week 6	February 12	SVE/SVE2, iFunc		Continue Blogging
Week 7	February 19	Building Large Software		Continue Blogging
Reading Week	February 26	Reading Week - No Classes
Week 8	March 4	Project Stage 1 Introduction	Project Stage 1 Discussion	Blog about Project
Week 9	March 11	Project Stage 1 Discussion	Project Stage 1 Discussion	Project Stage 1 Due
Week 10	March 18	Project Stage - Task Selection	Project Stage 2 Discussion	Blog about project
Week 11	March 25	Project Review	Good Friday - No Class	Blog about project
Week 12	April 1	Project Discussion	Project Discussion	Project Stage 2 Due
Week 13	April 8	Project Stage 3 Instructions	Project Discussion	Blog about project
Week 14	April 15	Project Discussin	Course Wrap-Up	Project Stage 3 Due

Current Participants

See SPO600 2024 Winter Participants

Course Notes

Note that content is being converted from the previous wiki. There may be links to content which has not yet been converted – these will be imported soon.

Week 1

Week 1 - Class I

Video

Summary Video - January 9, 2024 class

General Course Information

Course resources are linked from the CDOT wiki, starting at https://wiki.cdot.senecacollege.ca/wiki/SPO600 (Quick find: This page will usually be Google's top result for a search on “SPO600”), arranged by week and class. There will be lots of hyperlinks – be sure to follow these links.
Coursework is submitted by blogging. The only exception to this is quizzes.
Quizzes will be short (~1 page) and will be held without announcement at the start of any synchronous class. There is no opportunity to re-take a missed quiz, but your lowest three quiz scores will not be counted, so do not worry if you miss one or two.
- Students with test accommodations: an alternate monthly quiz can be made available via the Test Centre. Communicate with your professor for details.
Course marks (see Weekly Schedule for dates):
- 60% - Project Deliverables in three phases (15%, 20%, 25%)
- 20% - Communication (Blog writing, in four phases roughly a month long each, 5% each)
- 20% - Labs and Quizzes (10% labs; 10% for quizzes - lowest 3 quiz scores not counted)

About SPO600 Classes

Online Classes
- Wednesday and Friday 10:45-12:30 pm
- A summary video will be posted on a best-effort basis (technical issues may prevent posting in some cases). The summary video will be edited and may not include some discussion and questions that take place in the class. The link(s) to the video(s) will be posted on this page under the corresponding date.
- It is strongly recommended that you attend the online sessions and take notes.
Pre-recorded Classes
- From time to time, an online class may be replaced by pre-recorded videos. The links will be provided on this page under the corresponding date.

Introduction to the Problems

Porting and Portability

Most software is written in a high-level language which can be compiled into machine code for a specific computer architecture. In many cases, this code can be compiled or interpreted for execution on multiple computer architectures - this is called 'portable' code. However, there is a lot of existing code that contains some architecture-specific code fragments which contains assumptions about the architecture, resulting in architecture-specific high-level or Assembly Language code.
Reasons that code is architecture-specific:
- System assumptions that don't hold true on other platforms
  - Variable or word size
  - Endianness
- Code that takes advantage of platform-specific features
Reasons for writing code in Assembly Language include:
- Performance
- Atomic Operations
- Direct access to hardware features, e.g., CPUID registers
Most of the historical reasons for using assembler are no longer valid. Modern compilers can out-perform most hand-optimized assembly code, atomic operations can be handled by libraries or compiler intrinsics, and most hardware access should be performed through the operating system or appropriate libraries.
A new architecture has appeared: AArch64, which is part of ARMv8. This is the first new computer architecture to appear in several years (at least, the first mainstream computer architecture).
At this point, most key open source software (the software typically present in a Linux distribution such as Ubuntu or Fedora, for example) now runs on AArch64. However, it may not yet be as extensively optimized as on older architectures (such as x86_64).

Optimization

Optimization is the process of evaluating different ways that software can be written or built and selecting the option that has the best performance tradeoffs.

Optimization may involve substituting software algorithms, altering the sequence of operations, using architecture-specific code, or altering the build process. It is important to ensure that the optimized software produces correct results and does not cause an unacceptable performance regression for other use-cases, system configurations, operating systems, or architectures.

The definition of “performance” varies according to the target system and the operating goals. For example, in some contexts, low memory or storage usage is important; in other cases, fast operation; and in other cases, low CPU utilization or long battery life may be the most important factor. It is often possible to trade off performance in one area for another; using a lookup table, for example, can reduce CPU utilization and improve battery life in some algorithms, in return for increased memory consumption.

Most advanced compilers perform some level of optimization, and the options selected for compilation can have a significant effect on the trade-offs made by the compiler, affecting memory usage, execution speed, executable size, power consumption, and debuggability.

Benchmarking and Profiling

Benchmarking involves testing software performance under controlled conditions so that the performance can be compared to other software, the same software operating on other types of computers, or so that the impact of a change to the software can be gauged.

Profiling is the process of analyzing software performance on finer scale, determining resource usage per program part (typically per function/method). This can identify software bottlenecks and potential targets for optimization. The resource utilization studies may include memory, CPU cycles/time, or power.

Build Process

Building software is a complex task that many developers gloss over. The simple act of compiling a program invokes a process with five or more stages, including pre-processing, compiling, optimizing, assembling, and linking. However, a complex software system will have hundreds or even thousands of source files, as well as dozens or hundreds of build configuration options, auto configuration scripts (cmake, autotools), build scripts (such as Makefiles) to coordinate the process, test suites, and more.

The build process varies significantly between software packages. Most software distribution projects (including Linux distributions such as Ubuntu and Fedora) use a packaging system that further wraps the build process in a standardized script format, so that different software packages can be built using a consistent process.

In order to get consistent and comparable benchmark results, you need to ensure that the software is being built in a consistent way. Altering the build process is one way of optimizing software.

Note that the build time for a complex package can range up to hours or even days!

Course Setup

Follow the instructions on the SPO600 Communication Tools page to set up a blog, create SSH keys, and send your blog URLs and public key to me.

I will use this information to:

Update the Current SPO600 Participants page with your information, and
Create an account for you on the SPO600 Servers.

This updating is done in batches once or twice a week – allow some time!

Week 1 - Class II

Week 1 Deliverables

Set up your SPO600 Communication Tools
Perform Lab 1 and blog your results

Week 2

Week 2 - Class I

Video

Summary Video - January 16, 2024

Compilers: Standard Optimizations and Feature Flags

Compiler Optimizations including a brief discussion of feature flags (-f) and optimization levels (bundles of feature flag controlled by -O)

Week 2 - Class II

Video

An edited summary video covering 6502 Math / Jumps, Branches

6502 Math and Jumps, Branches, and Procedures

Week 2 Deliverables

Your Communication Tools should be set up by now!
Complete your Lab 1

Week 3

Week 3 - Class I

Video

An edited summary video covering Compiler Targets and Tuning

Compilers: Targets and Tuning

Resources
- GCC Online Documentation
- x86_64 psABI specification
- ARM ABI specifications

Week 3 - Class II

There is no synchronous (Zoom) class for January 26.

Video

6502 String Basics (26 minutes)
6502 String Input (72 minutes) Note: references in this video to Lab 3 in a previous semester are relevant to this semester's Lab 2, but the original code requirement has been increased from 25% to 75%.
A 6502 Assembly Language Hack (optional) (5 minutes)

Lab

Now it's your turn to experiment with 6502 assembly language and have some fun. The 6502 Math and Strings Lab (Lab 2) gives you a lot of flexibility to chose an interesting mini-project and execute it.

Week 3 Deliverables

Lab 2

Week 4

Week 4 - Class I

Video

GNU iFunc

Resources

Experimentation

Make sure you can login to both of the SPO600 Servers
Build and test the iFunc demo code (https://github.com/ctyler/ifunc-aarch64-demo) on the AArch64 server

Week 4 - Class II

Video

Edited Summary Video - 64 Bit Assembler, Part 1

Resources

Assembly Language
ELF file format
X86_64 Register and Instruction Quick Start
Aarch64 Register and Instruction Quick Start
ARM 64-bit CPU Instruction Set and Software Developer Manuals
ARM Aarch64 documentation
- ARM Developer Information Centre
- ARM Cortex-A Series Programmer’s Guide for ARMv8-A
- The short guide to the ARMv8 instruction set: ARMv8 Instruction Set Overview (“ARM ISA Overview”)
- The long guide to the ARMv8 instruction set: ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile (“ARM ARM”)
- Procedure Call Standard for the ARM 64-bit Architecture (AArch64)
x86_64 Documentation
- AMD Developer Guide and Manuals(see the AMD64 Architecture section, particularly the AMD64 Architecture Programmer’s Manual Volume 3: General Purpose and System Instructions)
- Intel Software Developers Manuals
GAS Manual - Using as, The GNU Assembler: https://sourceware.org/binutils/docs/as/

Week 4 Deliverables

Complete your Lab 2
Test your ability to access both SPO600 Servers
Blog about your test of the iFunc demo code
Ensure that your blog is ready for marking by the end of the weekend (February 4, 11:59 pm)

Week 5

Week 5 - Class I

Video

Edited summary video: 64-Bit Assembly, Part 2

Lab 3

64-Bit Assembly Language Lab (Lab 3)
- We performed steps 1-3 of the lab
- Steps 4 onward are for you to do

Week 5 - Class II

Video

Summary video - SIMD

Week 5 Deliverables

Perform Lab 3 and blog your results

Week 6

Week 6 - Class I

Video

Multiple Micro-Architectures and iFunc (30 minutes) (Note: this video was narrated via TTS)
A video from a previous semester for background: SVE and SVE2 (85 minutes)

Code Examples

The code used in the video is available in the directory /public/spo600-sve-sve2-ifunc-examples.tgz on aarch64-001.spo600.cdot.systems

Week 6 Deliverables

Lab 4 will be released on Friday.

Week 7

Video

Building Large Software Projects

Building GCC

These are the steps required to build GCC:

Obtain the source by anonymously pulling from the main branch of the git repository: git clone git://gcc.gnu.org/git/gcc.git
Create an empty build directory in which to build the software. This should not be inside the source tree; a good place to put it is beside the source tree.
Change your working directory to the build directory.
Perform this step ONLY as your regular, non-root user. Run the configure script in the source directory using a relative path (e.g., ../gcc/configure). Add a --prefix=dir option to specify where the software will be installed, plus any other configuration options you want to specify. The dir should be within your home directory, for example $HOME/gcc-test-001/
Run make with the -j n option to specify the maximum number of parallel jobs that should be executed at one time. The value of n should typically be in the range of (number of cores + 1) to (2 * number of cores + 1) depending on the performance characteristics of the machine on which you're building.
Run make install as a non-root user. Assuming you specified the prefix correctly above, the software should install into subdirectories of the prefix directory, e.g., prefix/bin, prefix/lib64, and so forth.
Add your bin directory to your PATH: PATH=“prefix/bin:$PATH”

There is no need to run any of these steps as the root user, and it is dangerous to run the installation step as the root user, because you could overwrite the system's copy of the software you're installing. Use your regular user account instead.

To build another copy of the same gcc version, perhaps with some code or configuratin changes, you can either repeat the process above with a fresh build directory (start at step 2), or you can run make clean in your existing build directory and then repeat the process above (start at step 4). Which option you choose will depend on whether you want to keep the previous build for reference.

Tip: Each build takes a lot of disk space (12GB or more in the build directory and 2.7GB or more in the installation directory), so check your available disk space periodically (df -h .). Delete unneeded builds reguarly. If you're using the class servers and space is getting low, let your professor know and he can adjust the system's storage configuration.

Testing Your Build

To test your build:

Having altered your PATH as noted above,verify the version of GCC that you're using by running: gcc --version – you should see the version reported as the version you cloned with git (GCC 14.xx.yy) and the build date (immediately after the version number) should match the date on which you build your copy of gcc.
Optionally, run the compiler testsuite.
Verify that the compiler operates correctly by using it to build your code. Ideally, test some features that should be present in the new version that won't be present in the system-installed copy of gcc. Remember that “gcc” stands for “GNU Compiler Collection” and includes not just the gcc C compiler, but also the g++ C++ compiler and compilers for other languages, plus supporting libraries and tools.

Week 8

Week 8 - Class I

Video

Introduction to the Auto-FMV Project

Project

2024 Winter Project

Week 8 - Class II

Video

Project Discussion (1)

Week 8 Deliverables

Blog about your Project Stage 1 work.

Week 9

Week 9 - Class I

Video

Project Stage 1 Discussion (2)

Using AArch64 Software on an x86_64 System

The qemu-aarch64 instruction emulator will enable the execution of aarch64 code on any Linux system.

If the system is an aarch64 system, then the majority of the code will run natively on the CPU, and qemu-aarch64 will only handle instructions that are not understood by the system. Therefore, if the CPU is an ARMv8 CPU, and the software is ARMv9 software, then the majority of the instructions will run directly on the CPU and the few instructions that exist in ARMv9 that are not present in ARMv8 (such as SVE2 instructions) will be handled much more slowly by the qemu-aarch64 software. You can use this approach to (for example) run ARMv9 software on the class aarch64 server.

To use qemu-aarch64 on an aarch64 system, place the qemu-aarch64 executable in front of the name of the executable you wish to run:

qemu-aarch64 testprogram ...

However, if the system is an x86_64 system, then the CPU will not be able to execute any of the aarch64 instructions, and all of the instructions will be emulated by the qemu-aarch64 software. That means that the code will execute, but at a fraction of the speed at which it would execute on an actual aarch64 system. However, it will run!

To use qemu-aarch64 on an x86_64 system, you will need the qemu-aarch64 software as well as a full set of userspace files (binaries, libraries, and so forth). You can obtain these from the /public directory on the class class x86_64 server:

$ ll -h /public/aarch64-f38*
-rw-r--r--. 1 chris chris 2.5K Oct 13 10:44 /public/aarch64-f38-root.README
-rw-r--r--. 1 chris chris 934M Oct 13 08:33 /public/aarch64-f38-root.tar.xz

The README file contains installation instructions. The tar.xz file contains the userspace, qemu-aarch64 static binary, and a startup script. Note that the tar.xz file is almost 1 GB in size, and will expand to approximately 3.5 GB when uncompressed.

When the tar.xz file is installed on a Linux system using the instructions in the README file, you will have a full aarch64 Fedora 38 Linux system available. The start-aarch64-chroot script in the top directory of the unpacked archive will start the qemu environment using a chmod command. Note that this is not a virtual machine – it's a specific group of processes running under the main system.

The /proc and /sys filesystems are not mounted by default in the aarch64 chroot. The best way to mount these is to add these lines to the /etc/fstab file within the chroot:

proc	/proc  proc    defaults 0 0
sysfs /sys   sysfs   defaults 0 0

You may want to comment out the lines for /boot and /boot/efi at the same time.

Once those changes have been made to the /etc/fstab file, you can mount the additional filesystems with the command:

mount -a

It may also be useful to set wide-open permissions on the /dev/null device:

chmod a+rw /dev/null

Note that in the chroot environment starts a root shell. You can create other users with the useradd command, and switch from root to those users with the command su - username

To build GCC in the aarch64 chroot, you will need to install these dependencies (use dnf):

gmp-devel
mpfr-devel
libmpc-devel
libmpc-devel
gcc-g++

Using a Raspberry Pi 4 or 5

The Raspberry Pi 4 and 5 utilize aarch64 processors, but are not very fast systems. The Pi 5 is noticably faster than the Pi 4 and is available with more RAM (8 GB).

You can use a Pi4 or a Pi5 to build software. When building code using make, a jobs value of -j5 is probably optimal.

Using Raspberry Pi OS, you will need to install (at least) these dependencies to build GCC (use apt install to install them):

gcc
make
libmpc-dev
libgmp-dev
libmpfr-dev

Build time for GCC 14 is approximately 168 minutes on a Pi5 with 8GB.

Then run configure and make with the usual arguments. Note that SD cards may be slow for storage - consider using an external USB3 SSD or the fastest SD card you can find.

Using Make Check on GCC

The GCC test suite, distributed with the source code, is based on the DejaGNU framework.

As documented in the notes for the compiler testsuite, you must use the -k option with make check:

make -k check

However, in order for this to succeed, the DejaGNU software must be installed on your target system. On Fedora, you can do this with sudo dnf install dejagnu. On Debian/Ubuntu/Raspberry Pi OS systems, use sudo apt install dejagnu.

Note that the test suite will take hours to execute, even on a fast system!

It produces a number of files ending in .sum which summarize the test results (it will also producce other log files - see the documentation). It's a good idea to merge the stdout and stderr of the make command and redirect that to a log file, too, perhaps like this:

$ time make -k check |& tee make-check.log

Week 9 - Class II

Video

Summary Video (pending)

Week 9 Deliverables

Project Stage 1

Week 10

Week 10 - Class I

Video

Summary Video (pending)

Task Selection

We selected and assigned tasks during this class. The task assignments are visible on the Participant Page as well as the Project Page.

Week 10 - Class II

Video

Summary Video (pending)

Week 10 Deliverables

Blog about your project work

Week 11

Week 11 - Class I

Week 11 Deliverables

Blog about your Project work

Week 12

Week 12 - Class I

Project Review

We reviewed the goals and approaches of the project.

The Problem

There are multiple versions of processors of every architecture currently in the market. You can see this when you go into a computer store such as Canada Computers or Best Buy – there are laptops and desktops with processors ranging from Atoms and Celerons to Ryzen 4/7/9 and Core i3/i5/i7/i9 processors, and workstations and servers with processors ranging up to Xeon and Epyc/Threadripper devices. Similarly, cellphones range from devices with Cortex-A35 cores through Neoverse X3 cores.

These wide range of devices support a diverse range of processor features.

Software developers (and vendors) are caught between supporting only the latest hardware, which limits the market they can sell to, or else harming the performance of their software by not taking advantage of recent processor improvements. Neither option is attractive for a software company wishing to be competitive.

The Goal

To take good advantage of processor features with minimal effort by the software developers.

Three Solutions

There are three solutions in various stages of preparation, each of which builds upon the previous solutions:

IFUNC - Indirect Functions - This is a solution provided by the development toolchain (compiler, linker, libraries) but which is largely manual for the software developer. The developer provides multiple alternate versions of performance-critical functions which are targeted at different micro-architectural levels, plus a resolver function that selects between the implementations at runtime. Note that IFUNC is the only solution which enables a resolver function that takes into account factors other than the micro-architectural level of the processor. For example, a resolver function could select beween alternate functions based on available memory, storage performance, or the speed of the network connection.
FMV - Function Multi-Versioning - This is a solution that is also supported by the development toolchain but which involves slightly less manual work for the developer. There are two levels of FMV:
1. FMV with Manual Alternate Functions - The programmer provides the alternate functions and uses function attributes to specify the microarchitectural level at which each is targeted. The resolver function for each group of alternate functions is automatically generated.
2. FMV with Cloned Functions - The program provides one version of the function and uses function attributes to specify that clones of that function are to be built, and the micro-architectural targets for each clone. The resolver function for each group of cloned functions is automatically generated. The only difference between the cloned functions is the micro-architectural optimizations that are applied by the compiler. Note that there is nothing to ensure that the clones are actually any better or in fact different from each other.
AFMV - Automatic Function Multi-Versioning - This is what we're working on - This is effectively FMV with Cloned Functions, but the cloning is controlled from the command line rather than using function attributes. This has the advantage that no source changes are required. Every function in the program is cloned, and the after the various optimization passes have been applied, the cloned functions are analyzed. If the functions are different, they are kept, but if they are idential, they are removed, and only the default version of the function is used.

Specifics: IFUNC

GCC IFUNC documenation:

Specifics: FMV

Current documentation:

1. GCC documentation

Mentions that FMV is only implemented for i386 (aka x86_32) - now false as it's also implemented for x86_64, aarch64, and Power
Does not mention target_clones syntax

2. ARM ACLE documentation

Does not talk about the current state of implementation
Mentions that FMV may be disabled at compile time by a compiler flag, but this flag is not documented
The macro __HAVE_FEATURE_MULTI_VERSIONING (or __FEATURE_FUNCTION_MULTI_VERSIONING or __ARM_FEATURE_FUNCTION_MULTIVERSIONING) does not appear to be defined

Implementation in GCC

Implemented and tested in (at least) x86_64, PowerPC4, and AArch64
I did not test the PowerPC version
Testing performed with GCC 14.0.1 20240223

On x86:
- Syntax to manually specify a function target: __attribue__((target("nnn"))) - where nnn may take the form of “default”, or “feature” eg., “sse4.2”, or “feature,feature” e.g., “sse4.2,avx2”, or it may take the form “arch=archlevel” e.g., “arch=x86-64-v3” or “arch=atom”
- target_version is not accepted as an alternative to target attribute
- Syntax to manually specify cloning: __attribute__((target_clone("nnn1", "nnn2" [...])))
- Works in both the C and C++ frontends

On AArch64:
- Current support landed Dec 16, 2023
  - Syntax to manually specify a function target: __attribute__((target_version("nnn"))) - where nnn may take the form of “default”, or “feature” e.g., “sve”, or “feature+feature” e.g., “sve+sve2” (Note: in some earlier versions of GCC, a plus-sign was required at the start of the feature list, e.g., “+sve” instead of “sve”. This was changed by gcc 14). Note the use of the attribute target_version as opposed to target (as used on x86) which according to the ACLE . Note that the “arch=nnn” format is not supported.
  - Syntax to manually specify cloning: __attribute__((target_clone("nnn", "nnn" [...]))) - note that contrary to some of the documentation, there is no automatic “default” argument - the first argument supplied should be “default”
  - Manually specified function target works in the C++ frontend only, but automatic cloning appears to work in both C and C++. Note that most C code can be compiled with the C++ frontend, except for some recent C enhancements not understood by C++ as well as some C++ keywords that are not reserved in C

Week 11 Deliverables

Submit your Project Stage 2

Table of Contents

SPO600

Current Participants

Course Notes

Week 1

Week 1 - Class I

Video

General Course Information

About SPO600 Classes

Introduction to the Problems

Porting and Portability

Optimization

Benchmarking and Profiling

Build Process

Course Setup

Week 1 - Class II

Video

6502 Assembly

Lab 1

Week 1 Deliverables

Week 2

Week 2 - Class I

Video

Compilers: Standard Optimizations and Feature Flags

Week 2 - Class II

Video

6502 Math and Jumps, Branches, and Procedures

Week 2 Deliverables

Week 3

Week 3 - Class I

Video

Compilers: Targets and Tuning

Week 3 - Class II

Video

Lab

Week 3 Deliverables

Week 4

Week 4 - Class I

Video

Resources

Experimentation

Week 4 - Class II

Video

Resources

Week 4 Deliverables

Week 5

Week 5 - Class I

Video

Lab 3

Week 5 - Class II

Video

Week 5 Deliverables

Week 6

Week 6 - Class I

Video

Code Examples

Week 6 Deliverables

Week 7

Video

Building GCC

Testing Your Build

Week 8

Week 8 - Class I

Video

Project

Week 8 - Class II

Video

Week 8 Deliverables

Week 9

Week 9 - Class I

Video

Using AArch64 Software on an x86_64 System

Using a Raspberry Pi 4 or 5

Using Make Check on GCC

Week 9 - Class II

Video

Week 9 Deliverables

Week 10

Week 10 - Class I

Video