Rust performance results converting slice into array

Written on 2021-11-10 in 2088 words ✍️.
Part of cs IT-security pqcrypto rustlang

Motivation

TL;DR This tries to evaluate the best approach for a problem which cannot be solved perfectly in rust.

In post-quantum schemes, you often have the structure that you get supplied a large bytearray with data (some rust array) via a communication channel. You are expected to decompose it into individual parts. I want to illustrate this with the following example:

fn decrypt(ct: &[u8; 8]) {
  let mut value_f = [0u8; 4];
  value_f.copy_from_slice(&ct[0..4]);

  let mut value_finv = [0u8; 4];
  value_finv.copy_from_slice(&ct[4..8]);

  // … use value_f and value_finv in subroutines
  // to compute cryptographic operations …

  assert_eq!(value_f[1], 6);
  assert_eq!(value_finv[1], 7);
}

fn main() {
  let ciphertext = [1u8, 6, 8, 7, 9, 7, 4, 2];
  decrypt(&ciphertext);
}

Assume f and finv are two cryptographic variables, we want to process as individual values.

Using the type &[u8; 8] and method copy_from_slice is only particular style. But there are alternatives. Thus my motivation to take some notes on arrays and slices. Now the question is: What are the alternatives? Which is the most efficient approach?

I decided to list them and evaluate them in microbenchmarks. I use the architecture based on RDTSC. I ran all benchmarks 1000 times and use a (not 8, but) 1024-element array to be split into two halves. The benchmarks ran on my Intel x86_64 CPU.

One additional idea pointed out by David was to add assert_eq!(ct.len(), 1024) in the case of a slice. This allows the compiler make one initial runtime check for the size. Subsequently, all size checks could be skipped. Does this information actually propagate?

Approaches

`copy_from_slice`

let mut value_f = [0u8; 512];
value_f.copy_from_slice(&ct[0..512]);

let mut value_finv = [0u8; 512];
value_finv.copy_from_slice(&ct[512..1024]);

The documentation already points out a copy operation (“Copies all elements from src into self, using a memcpy”). Thus, it does not look promising. Looking at the source code, we see that core::ptr::copy_nonoverlapping is used which is semantically equivalent to C’s memcpy.

type	median clock cycle (less is better)
`ct: &[u8; 1024]`	37602
`ct: &[u8]`	37198
`ct: &[u8]` with assert	36768

copy with for-loop

let mut value_f = [0u8; 512];
for i in 0..512 { value_f[i] = ct[i]; }

let mut value_finv = [0u8; 512];
for i in 512..1024 { value_finv[i - 512] = ct[i]; }

Since copy_from_slice actually copies, we could just copy data manually by iterating over elements and assigning them.

type	median clock cycle (less is better)
`ct: &[u8; 1024]`	73210
`ct: &[u8]`	73738
`ct: &[u8]` with assert	73786

transmute

let value_f = unsafe { std::mem::transmute::<*const [u8], &[u8; 512]>(&ct[0..512] as *const [u8]) };
let value_finv = unsafe { std::mem::transmute::<*const [u8], &[u8; 512]>(&ct[512..1024] as *const [u8]) };

Now let us leave safe rust for a moment. Another copying method is transmute. It turns out that transmute checks the length of its arguments. And it does not even allow to extract a subset (smaller array of a larger one). As I result, I couldn’t make it compile.

error[E0512]: cannot transmute between types of different sizes, or dependently-sized types
   --> rdtsc.rs:101:31
    |
101 |     let value_finv = unsafe { std::mem::transmute::<*const [u8], &[u8; 512]>(&ct[512..1024] as *...
    |                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: source type: `*const [u8]` (128 bits)
    = note: target type: `&[u8; 512]` (64 bits)

`try_from`

let value_f = <[u8; 512]>::try_from(&ct[0..512])?;
let value_finv = <[u8; 512]>::try_from(&ct[512..1024])?;

The TryFrom trait seems to be the best approach for this. rust implemented a custom routine to go from type &[u8] to [u8; N]. Obviously, it should just compile to a single runtime check evaluating length equivalence.

type	median clock cycle (less is better)
`ct: &[u8; 1024]`	37456
`ct: &[u8]`	36152
`ct: &[u8]` with assert	37234

arrayref

let value_f = array_ref![ct, 0, 512];
let value_finv = array_ref![ct, 512, 512];

Lukas pointed me to the arrayref crate. It explicitly targets our usecase and solves it with macros.

type	median clock cycle (less is better)
`ct: &[u8; 1024]`	40716
`ct: &[u8]`	41242
`ct: &[u8]` with assert	39692

Why do these types exist anyway?

An array (e.g. [u8; 8]) is a contiguous memory section containing a homogeneous collection. When you index it with a range, you get a reference to a slice (e.g. &[u8]) which is a pointer to the first element and a length. You can get its length with the len method.

Wait… so, &[u8; 8] is a pointer to a memory section together with a length. And &[u8] is a pointer to a memory section together with a length. What is the difference?

Fundamentally, &[u8; 8] contains the length as part of its type. Thus, the length is a compile-time variable. However, like in most other languages, this property is not used to its full extent. If you split &[u8; N] into two halves, you should get two values of types &[u8; N/2]. But the rust compiler cannot propagate such properties yet. There are many unsolved issues related to dependent types.
In contrast, &[u8] contains the length as runtime variable. Thus the compiler needs to place bound checks much later and can rarely optimize them away.

So, this seems to be a compile-time-versus-runtime thing. So, does it actually perform compile-time optimizations?

Indexing of slices is implemented by just calling indexing intrinsics.
Indexing of arrays is implemented by creating a reference to a slice and then calling its index operation.

So, actually both of them are implemented the same way. It seems unlikely, the compiler skips checks in the compile-time case.

Conclusion

RFC 0495 (arrayref pointed to it - thx!) explicitly mentions this problem (though their usecase is pattern matching). It might lead to some improvements in the long-term.

Originally, I planned to look at the generated assembly. But since many parts call stdlib elements, it is very difficult to trace which instructions are actually executed.

Looking at the data, I conclude:

manual copy takes twice as long → never do this!
arrayref seems to introduce some additional instructions in its macros (thus there is a small overhead for some syntactic sugar)
The difference between copy_from_slice and try_from is negligible.