Rust performance results converting slice into array

Written on 2021-11-10 in 2088 words ✍️.
Part of cs IT-security pqcrypto rustlang

Motivation

TL;DR This tries to evaluate the best approach for a problem which cannot be solved perfectly in rust.

In post-quantum schemes, you often have the structure that you get supplied a large bytearray with data (some rust array) via a communication channel. You are expected to decompose it into individual parts. I want to illustrate this with the following example:

fn decrypt(ct: &[u8; 8]) {
  let mut value_f = [0u8; 4];
  value_f.copy_from_slice(&ct[0..4]);

  let mut value_finv = [0u8; 4];
  value_finv.copy_from_slice(&ct[4..8]);

  // … use value_f and value_finv in subroutines
  // to compute cryptographic operations …

  assert_eq!(value_f[1], 6);
  assert_eq!(value_finv[1], 7);
}

fn main() {
  let ciphertext = [1u8, 6, 8, 7, 9, 7, 4, 2];
  decrypt(&ciphertext);
}

Assume f and finv are two cryptographic variables, we want to process as individual values.

Using the type &[u8; 8] and method copy_from_slice is only particular style. But there are alternatives. Thus my motivation to take some notes on arrays and slices. Now the question is: What are the alternatives? Which is the most efficient approach?

I decided to list them and evaluate them in microbenchmarks. I use the architecture based on RDTSC. I ran all benchmarks 1000 times and use a (not 8, but) 1024-element array to be split into two halves. The benchmarks ran on my Intel x86_64 CPU.

One additional idea pointed out by David was to add assert_eq!(ct.len(), 1024) in the case of a slice. This allows the compiler make one initial runtime check for the size. Subsequently, all size checks could be skipped. Does this information actually propagate?

Approaches

copy_from_slice

let mut value_f = [0u8; 512];
value_f.copy_from_slice(&ct[0..512]);

let mut value_finv = [0u8; 512];
value_finv.copy_from_slice(&ct[512..1024]);

The documentation already points out a copy operation (“Copies all elements from src into self, using a memcpy”). Thus, it does not look promising. Looking at the source code, we see that core::ptr::copy_nonoverlapping is used which is semantically equivalent to C’s memcpy.

type median clock cycle (less is better)

ct: &[u8; 1024]

37602

ct: &[u8]

37198

ct: &[u8] with assert

36768

copy with for-loop

let mut value_f = [0u8; 512];
for i in 0..512 { value_f[i] = ct[i]; }

let mut value_finv = [0u8; 512];
for i in 512..1024 { value_finv[i - 512] = ct[i]; }

Since copy_from_slice actually copies, we could just copy data manually by iterating over elements and assigning them.

type median clock cycle (less is better)

ct: &[u8; 1024]

73210

ct: &[u8]

73738

ct: &[u8] with assert

73786

transmute

let value_f = unsafe { std::mem::transmute::<*const [u8], &[u8; 512]>(&ct[0..512] as *const [u8]) };
let value_finv = unsafe { std::mem::transmute::<*const [u8], &[u8; 512]>(&ct[512..1024] as *const [u8]) };

Now let us leave safe rust for a moment. Another copying method is transmute. It turns out that transmute checks the length of its arguments. And it does not even allow to extract a subset (smaller array of a larger one). As I result, I couldn’t make it compile.

error[E0512]: cannot transmute between types of different sizes, or dependently-sized types
   --> rdtsc.rs:101:31
    |
101 |     let value_finv = unsafe { std::mem::transmute::<*const [u8], &[u8; 512]>(&ct[512..1024] as *...
    |                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: source type: `*const [u8]` (128 bits)
    = note: target type: `&[u8; 512]` (64 bits)

try_from

let value_f = <[u8; 512]>::try_from(&ct[0..512])?;
let value_finv = <[u8; 512]>::try_from(&ct[512..1024])?;

The TryFrom trait seems to be the best approach for this. rust implemented a custom routine to go from type &[u8] to [u8; N]. Obviously, it should just compile to a single runtime check evaluating length equivalence.

type median clock cycle (less is better)

ct: &[u8; 1024]

37456

ct: &[u8]

36152

ct: &[u8] with assert

37234

arrayref

let value_f = array_ref![ct, 0, 512];
let value_finv = array_ref![ct, 512, 512];

Lukas pointed me to the arrayref crate. It explicitly targets our usecase and solves it with macros.

type median clock cycle (less is better)

ct: &[u8; 1024]

40716

ct: &[u8]

41242

ct: &[u8] with assert

39692

Why do these types exist anyway?

An array (e.g. [u8; 8]) is a contiguous memory section containing a homogeneous collection. When you index it with a range, you get a reference to a slice (e.g. &[u8]) which is a pointer to the first element and a length. You can get its length with the len method.

Wait… so, &[u8; 8] is a pointer to a memory section together with a length. And &[u8] is a pointer to a memory section together with a length. What is the difference?

  • Fundamentally, &[u8; 8] contains the length as part of its type. Thus, the length is a compile-time variable. However, like in most other languages, this property is not used to its full extent. If you split &[u8; N] into two halves, you should get two values of types &[u8; N/2]. But the rust compiler cannot propagate such properties yet. There are many unsolved issues related to dependent types.

  • In contrast, &[u8] contains the length as runtime variable. Thus the compiler needs to place bound checks much later and can rarely optimize them away.

So, this seems to be a compile-time-versus-runtime thing. So, does it actually perform compile-time optimizations?

  • Indexing of slices is implemented by just calling indexing intrinsics.

  • Indexing of arrays is implemented by creating a reference to a slice and then calling its index operation.

So, actually both of them are implemented the same way. It seems unlikely, the compiler skips checks in the compile-time case.

Other solutions

How do other libraries solve the problem? Which type do they pick?

ring can serve as reference. It uses references to slices in general.

Conclusion

RFC 0495 (arrayref pointed to it - thx!) explicitly mentions this problem (though their usecase is pattern matching). It might lead to some improvements in the long-term.

Originally, I planned to look at the generated assembly. But since many parts call stdlib elements, it is very difficult to trace which instructions are actually executed.

Looking at the data, I conclude:

  • manual copy takes twice as long → never do this!

  • arrayref seems to introduce some additional instructions in its macros (thus there is a small overhead for some syntactic sugar)

  • The difference between copy_from_slice and try_from is negligible.