mod3 of NTRU's reference implementation

✍️ Written on 2021-06-16 in 2261 words. Part of cs IT-security pqcrypto software-development

Motivation

NTRU is one of the finalists of the NIST PQC standardization process. The NTRU reference implementation contains some source code, we want to analyze. Similar code occurs in all candidates, but I wanted to understand it thoroughly. This article follows my previous article discussing modulo computations.

The mod3 function

static uint16_t mod3(uint16_t a)
{
  uint16_t r;
  int16_t t, c;

  r = (a >> 8) + (a & 0xff); // r mod 255 == a mod 255
  r = (r >> 4) + (r & 0xf); // r' mod 15 == r mod 15
  r = (r >> 2) + (r & 0x3); // r' mod 3 == r mod 3
  r = (r >> 2) + (r & 0x3); // r' mod 3 == r mod 3

  t = r - 3;
  c = t >> 15;

  return (c&r) ^ (~c&t);
}

The goal

The goal of this function is to compute modulo 3 for a given 16 bit value.

The immediate approach

In the previous article, we have seen how to derive a modulo computation for any constant integer. But we did not realize it in a programming language. I propose the following C function:

uint16_t immediate_mod3(uint16_t a) {
  int i;

  for (i = 0; i < 3; i++)
    a = ( ((a >> 14)       )
        + ((a >> 12) & 0x03)
        + ((a >> 10) & 0x03)
        + ((a >>  8) & 0x03)
        + ((a >>  6) & 0x03)
        + ((a >>  4) & 0x03)
        + ((a >>  2) & 0x03)
        + ((a      ) & 0x03));

  return a;
}

So, why not just use this one instead of mod3? The answer is, that there are three problems.

Problems

  1. Adding up chunks here actually means using 7 or more additions. In mod3, we only used 4. Because, we use chunks of size 2, there are 8 summands. If we use larger chunks, there would be less additions and shifts.

  2. Why 3 iterations? After 3 iterations, all overflows have been handled and the value will always be a 3 bits value. This was determined empirically. However, how many iterations do we need in mod3? The line r = (r >> 2) + (r & 0x3); can be found 2 times in mod3. But why does 2 suffice?

  3. Consider the value 3. Then ((a) & 0x03) will evaluate to 3 and will stay the same forever across iterations. Thus this algorithm is incorrect. A modular reduction for 3 is missing. Why?
    Consider the previous article. If the remainder showed \(d_0 - d_1\), we actually get the correct result. But our approach to remove any negative digit in the block representation breaks correctness. This was necessary, because we would not be able to write the corresponding C operation. There is no possibility to represent one bit of an integer negatively.

So, we can improve problem #1, we need to analyze question #2 and fix problem #3.

#1: Reduce the number of shifts and additions

The idea is simple: We reduce by 255 first, then by 15, then by 3. This way, the chunks are larger and less operations are needed. Why these numbers? Because the numbers plus 1 is a multiple of our previous chunk size 2.

#2: Number of iterations

We annotate the source code:

r = (a >> 8) + (a & 0xff);  // {9 bits} ∈ [0, 510] = {8 bits} + {8 bits};
r = (r >> 4) + (r & 0xf);   // {6 bits} ∈ [0,  45] = {5 bits} + {4 bits};
r = (r >> 2) + (r & 0x3);   // {4 bits} ∈ [0,  13] = {4 bits} + {2 bits};
r = (r >> 2) + (r & 0x3);   // {3 bits} ∈ [0,   5] = {2 bits} + {2 bits};

It is important to recognize that all points during the computation, the values computed are the correct values in ℤ3. However, we only miss the modular reductions (i.e. we miss sum subtractions of 3 to get the representative in [0, 2]).

It is also interesting to recognize that an addition of a 4 bit value with a 2 bit value gives a 5 bit value. However, we need to consider the actual possible values. In this scenario, we see that the result in line 3 can only use up to 4 bits.

Without the last line, we would get some value between 0 and 13. With the last line, we get some value between 0 and 5. Only the latter case will be easy to handle when fixing the next problem.

#3: Fix missing reductions

Let r ∈ {0, 1, 2, 3, 4, 5}. Let r be unsigned, c and t signed 16 bits integers.

t = r - 3;
c = t >> 15;
return (c&r) ^ (~c&t);

Then t ∈ {2, 1, 0, -1, -2, -3}. Now, if we shift a 16 bit signed integer by 15 position to the right. The signed semantics might be non-intuitive. -1 >> 15 gives binary 1111111111111111 (hex 0xFF). On the other hand, a positive value shifted 15 times, yields 0. Hence, c ∈ {0, 0xFF}.

Now, we need a case distinction:

If c is 0

then we evaluate (0x00 & r) ^ (~0x00 & t) which is 0x00 ^ (0xFF & t) which is t which is one of 0, 1, and 2.

else c is non-zero

then we evaluate (0xFF & r) ^ (~0xFF & t) which is r ^ (0x00 & t) which is r one of {0, 1, 2}.

Let us summarize all possible values and their evaluations:

r

0

1

2

3

4

5

t

-3

-2

-1

0

1

2

c

0xFF

0xFF

0xFF

0x00

0x00

0x00

return

0

1

2

0

1

2

r % 3

0

1

2

0

1

2

Summary

  • The first part of mod3 computes modulo 3 as explained in the previous article giving a value in [0, 5].

  • The second part maps [0, 5] to [0, 2] with bitwise operations.

  • Why the complexity?

    • In PQC, we try to write algorithms which will be used million times per day on devices. Saving runtime is crucial (though, it should not be the only metric, see djb’s “Cryptographic competitions”).

      • I measured about 170 cycles with rdtsc with gcc 9.3 on x86_64 for immediate_mod3. I measured about 95 cycles for mod3 and about 60 cycles for % 3.

      • I measured 54, 34, and 38 cycles respectively with -O3.

    • In PQC, we try to achieve leakage resilience. One basic building block is constant time execution.

Conclusion

  • We analyzed a 12 LOC C function which is part of the reference implementation of one PQC scheme.

  • We have discovered arithmetic operations to compute modulo 3.

  • We have also discovered bitwise operations to map the value 3 to 0.

  • To find proper engineered bits of software is fun!