Lukas' weblog

✍️ Written on 2021-11-10 in 2471 words.
Part of cs software-development programming-languages rustlang

Motivation

In the past, one wrote inline assembly in rust using the asm!(…) with some syntax which was forwarded one-by-one to LLVM (btw, LLVM inline assembly syntax is based on GCC’s). Recently, rust moved to a new macro which is meant to be more generic and idiomatic. The old syntax macro was renamed llvm_asm!(…) and the new one is introduced as asm!(…). The details are discussed in RFC 2873.

Inline assembly only works on the nightly release of the compiler. Of course, the instructions are platform-specific and thus, the examples shown only work for x86_64 CPUs. In this blogpost, I want to instrument the RDTSC instruction in rust inline assembly.

CPUID, manufacturer ID and RDTSC support

CPUID is an instruction to get vendor, feature, instruction support, and metadata information about one’s processor.

We will use it to first print the manufacturer ID and then check for support of RDTSC. The instructions themselves are written in Intel syntax, thus have the layout INSTR dst, src.

#![feature(asm)]

#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
fn has_rdtsc_support() -> bool {
  // Step 1: ask for generic information and print it to stdout
  {
    let ebx: u32;
    let ecx: u32;
    let edx: u32;
    let mut manu_id = [0u8; 12];

    unsafe {
      asm!(
        "cpuid",
        "mov {bx:e}, ebx",
        // “output operands” following
        bx = lateout(reg) ebx,
        lateout("ecx") ecx,
        lateout("edx") edx,
        // “input operands” following
        in("eax") 0,
        // “clobbers” list
        lateout("eax") _,
        // “options” ∈ {"pure", "nomem", "nostack"}
        options(nomem, nostack)
      );
    }

    // assemble manufacturer ID ASCII string
    for i in 0..4 {
      // NOTE order [0, 8, 4] is intentional - see CPUID docu
      manu_id[i + 0] = (ebx >> 8 * i) as u8;
      manu_id[i + 8] = (ecx >> 8 * i) as u8;
      manu_id[i + 4] = (edx >> 8 * i) as u8;
    }
    let manufacturer_id = (0..12).map(|i| char::from(manu_id[i])).collect::<String>();

    dbg!(manufacturer_id);
  }

  // Step 2: ask for rdtsc support
  let edx: u32;
  {
    unsafe {
      asm!(
        "cpuid",
        in("eax") 0x80000001u64,
        lateout("ecx") _,
        lateout("edx") edx,
        options(nomem, nostack),
      );
    }
  }

  (edx >> 27) & 1 > 0
}

First, we set register eax (in("eax") 0) to value 0. Then invoke CPUID, to read the manufacturer string from the registers. We are interested in the result written to registers ebx, ecx, and edx. We use lateout to declare them as output registers because we only read the registers once all inputs have been consumed. We also specify that our assembly does not use the main memory or stack (options(nomem, nostack)) but is not pure (invoking it again will yield a different result). To the best of my knowledge, output registers will be added to the clobber list automatically.

To finish, I just apply very primitive ASCII parsing and use the dbg! macro to show the assembled manufacturer ID.

Finally, I invoke CPUID again with argument 0x80000001 which returns the RDTSC support indicator bit (namely bit 27) in register edx.

RDTSC

Since Intel Pentium x86 processors, the RDTSC instruction returns the current time stamp counter of the processor. It is fundamental to measure a routine in clock cycles. On native platforms, this instruction was used for exploiting the Spectre/Meltdown vulnerabilities. As a result the data is only rounded to a certain precision available in the Mozilla Firefox browser.

When I heard about RDTSC the first time, I was wondering when those 64 bits will overflow, since the frequency is quite high. My CPU reports about 1624.382 MHz (be aware of dynamic frequency scaling). 2⁶⁴ states in u64 divided by 1,703,287,980 operations per second gives us 10,830,079,405 seconds. This amounts to 343 years until an overflow occurs.

#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
fn rdtscp() -> (u64, u32) {
  let eax: u32;
  let ecx: u32;
  let edx: u32;
  {
    unsafe {
      asm!(
        "rdtscp",
        lateout("eax") eax,
        lateout("ecx") ecx,
        lateout("edx") edx,
        options(nomem, nostack)
      );
    }
  }


  let counter: u64 = (edx as u64) << 32 | eax as u64;
  (counter, ecx)
}

We simply call the RDTSC instruction, fetch the result from eax, ecx, and edx registers and assemble the 64-bit value. In detail, the EDX register is loaded with the high-order 32 bits of the IA32_TSC MSR. The EAX register is loaded with the low-order 32 bits of the IA32_TSC MSR. And the ECX register is loaded with the low-order 32-bits of IA32_TSC_AUX MSR. We mainly care about the IA32_TSC value.

Main routine

In the main routine, we first call has_rdtsc_support. If support is provided, we build a benchmark suite with rdtscp. Be aware that proper setup of benchmarked routines is important.

Either use the volatile crate or make sure the results are used. Otherwise the compiler might discard the instructions. Because I didn’t want to introduce a dependency, I have some awkward variable_to_avoid_optimizations.
You might want to flush cache lines with clflush to prevent interference from previous instructions.
You need to declare input/output/clobber registers in Assembly properly. Otherwise the results might be wrong or the hardware-software contract might be violated. As a result, the measurement will be invalid.
You might want to pin a process to one core. Thus, set its affinity.

const REPETITIONS: usize = 200;

fn main() -> Result<(), Box<dyn std::error::Error>> {
  if !has_rdtsc_support() {
    panic!("rdtsc instruction not available on your CPU")
  }

  let mut measurements = vec![];
  let mut variable_to_avoid_optimizations = 0;
  for i in 0..REPETITIONS {
    let (start, _) = rdtscp();

    // op to measure
    let mut a = 0;
    for _ in 0..100 {
      a += 2;
      if i % 3 == 1 {
        // to vary measurement time a little
        a += 1;
      }
    }

    let (end, _) = rdtscp();

    variable_to_avoid_optimizations = (variable_to_avoid_optimizations % 1000) + a;
    measurements.push(end - start);
  }
  if variable_to_avoid_optimizations == 99_999 {
    println!("foobar");
  }

  dbg!(&measurements);
  measurements.sort();
  let mean = measurements.iter().sum::<u64>() as f64 / measurements.len() as f64;

  println!("min = {}", measurements[0]);
  println!("mean = {}", mean);
  println!("median = {}", measurements[measurements.len() / 2]);
  println!("max = {}", measurements[measurements.len() - 1]);

  Ok(())
}

Example data

[rdtsc.rs:47] manufacturer_id = "GenuineIntel"
[rdtsc.rs:129] &measurements = [
    5530,
    4828,
    4694,
    4688,
    4878,
    4694,
    …
    4706,
    511258,
    4724,
    …
    4672,
    20628,
    6672,
    …
    4658,
    4800,
    4670,
    4668,
    4808,
    …
    4686,
    4686,
    4856,
]
min = 4654
mean = 7358.24
median = 4690
max = 511258

I regularly get one (or two) outliers (here: {511258, 20628}). Thus, you should take a look at the median and 95% percentile, not mean and maximum.

Conclusion

Subjectively, I prefer the new asm!(…) macro syntax in contrast to the llvm_asm!(…) syntax. Objectively, rust provides a mature inline assembly interface.

RDTSC allows you to evaluate benchmarks in clock cycles. This is preferred over time because we want to independent on the frequency. In practice, you want to use a proper benchmarking library like the criterion crate which takes care of warmup and accurate measurements. In a F.A.Q., they address associated issues. And in cryptography, one benchmarking library is SUPERCOP which can give some inspiration.