Does wrapping/saturating generate branches?

✍️ Written on 2021-12-20 in 1214 words.
Part of cs software-development rustlang

Motivation

It is very difficult to predict when multiplication will overflow. As such in memory-unsafe languages, people tend to use the multiplication algorithm without appropriate checks. As a result, rust decided to provide dedicated multiplication methods (e.g. of the primitive data type i32):

  • fn wrapping_mul(self, rhs: i32) → i32 computes the product of both arguments modulo 232

  • fn saturating_mul(self, rhs: i32) → i32 computes the product of both arguments or i32::MAX/i32::MIN in case of an overflow/underflow respectively

There are also additional functions which give you more information about overflows:

  • fn overflowing_mul(self, rhs: i32) → (i32, bool) returns the result of wrapping_mul and a boolean value indicating whether an overflow occured

  • fn checked_mul(self, rhs: i32) → Option<i32> returns the product iff no overflow/underflow occurs

  • fn unchecked_mul(self, rhs: i32) → i32 computes the product and triggers undefined behavior if an overflow/underflow occurs

Do these functions generate branches? Branches might be of interest for cryptographic applications (constant-time algorithms) and performance (branch freedom). Intuitively, they are going to. They will compute the product of two 32 bits integers as a 64 integer and check whether the high 32 bits of the result are non-zero. But let us evaluate it.

The methodology

We use godbolt and look at the following example, where op is replaced by one of the two functions above:

pub fn square(num: i32) -> i32 {
    num.op(num)
}

Then we look at the assembly and identify jump instructions and/or labels to determine whether a branch was generated. Specifically, we are interested in the following four architectures:

  • x86_64-unknown-linux-gnu

  • armv7a-none-eabi

  • riscv32imac-unknown-none-elf

  • wasm32-unknown-unknown

I used a command like rustc test.rs --emit asm --crate-type lib --target=x86_64-unknown-linux-gnu. WebAssembly requires the #![no_std] annotation in the source file. Additionally, we look at corresponding addition operations.

Results

Creates a branch? wrapping_mul saturating_mul wrapping_add saturating_add

x86_64-unknown-linux-gnu

no; imull

yes; imull and jne

no; addl

no; addl

armv7a-none-eabi

no; mul

yes; mul and beq

no; add

no; qadd

riscv32imac-unknown-none-elf

no; mul

yes; mul, mulh, and bne

no; add

yes; add and bne

wasm32-unknown-unknown

no; i32.mul

yes; i32.add, br_if, and br_table

no; i32.add

no; i32.add

In the case of branched saturating methods, the checked method can be found explicitly in the executable.

A deeper look at saturating_add at x86_64-unknown-linux-gnu

    .text
    .file	"test.6a688027-cgu.0"
    .section	.text._ZN4test6square17h5fb345a26144c432E,"ax",@progbits
    .globl	_ZN4test6square17h5fb345a26144c432E
    .p2align	4, 0x90
    .type	_ZN4test6square17h5fb345a26144c432E,@function _ZN4test6square17h5fb345a26144c432E:
    .cfi_startproc
    pushq	%rax
    .cfi_def_cfa_offset 16
    movl	%edi, %eax
    movl	%eax, %edi
    addl	%edi, %edi
    testl	%edi, %edi
    setns	%cl
    movzbl	%cl, %ecx
    addl	$2147483647, %ecx
    addl	%eax, %eax
    cmovol	%ecx, %eax
    movl	%eax, 4(%rsp)
    movl	4(%rsp), %eax
    movl	%eax, (%rsp)
    movl	(%rsp), %eax
    popq	%rcx
    .cfi_def_cfa_offset 8
    retq
.Lfunc_end0:
    .size	_ZN4test6square17h5fb345a26144c432E, .Lfunc_end0-_ZN4test6square17h5fb345a26144c432E
    .cfi_endproc

    .section	".note.GNU-stack","",@progbit

Apparently, EDX is added with EDX and then an AND is computed between the arguments (testl). setns %cl will set the lowest byte of register ECX if SF=0 (thus the sign flag was set after AND). Then a zero-extended byte is converted into a long and a MOV is applied (movzbl). Then this value is added to 231 - 1. After another add of EAX, cmovol is executed, which applies a conditional MOV if an overflow occured.

At this point, we can already identify the idea how to circumvent a branched instruction: we look at the highest bit of the argument (remember that function square above has only one argument). If it is set, we will get an overflow and will overwrite the result with the high value. Otherwise, we will result the non-overflowing sum.

Conclusion

I expected the intuitive result to be true for every platform. But apparently, x86_64, ARM and WASM can handle saturating_add without branches. Very surprising! One can argue that the branch/condition is hidden inside the instructions, but this was not the original question. I care about branches on instruction level.