Motivation
NTRU is one of the finalists of the NIST PQC standardization process. The NTRU reference implementation contains some source code, we want to analyze. Similar code occurs in all candidates, but I wanted to understand it thoroughly. This article follows my previous article discussing modulo computations.
The mod3 function
static uint16_t mod3(uint16_t a)
{
uint16_t r;
int16_t t, c;
r = (a >> 8) + (a & 0xff); // r mod 255 == a mod 255
r = (r >> 4) + (r & 0xf); // r' mod 15 == r mod 15
r = (r >> 2) + (r & 0x3); // r' mod 3 == r mod 3
r = (r >> 2) + (r & 0x3); // r' mod 3 == r mod 3
t = r  3;
c = t >> 15;
return (c&r) ^ (~c&t);
}
The goal
The goal of this function is to compute modulo 3 for a given 16 bit value.
The immediate approach
In the previous article, we have seen how to derive a modulo computation for any constant integer. But we did not realize it in a programming language. I propose the following C function:
uint16_t immediate_mod3(uint16_t a) {
int i;
for (i = 0; i < 3; i++)
a = ( ((a >> 14) )
+ ((a >> 12) & 0x03)
+ ((a >> 10) & 0x03)
+ ((a >> 8) & 0x03)
+ ((a >> 6) & 0x03)
+ ((a >> 4) & 0x03)
+ ((a >> 2) & 0x03)
+ ((a ) & 0x03));
return a;
}
So, why not just use this one instead of mod3
? The answer is, that there are three problems.
Problems

Adding up chunks here actually means using 7 or more additions. In
mod3
, we only used 4. Because, we use chunks of size 2, there are 8 summands. If we use larger chunks, there would be less additions and shifts. 
Why 3 iterations? After 3 iterations, all overflows have been handled and the value will always be a 3 bits value. This was determined empirically. However, how many iterations do we need in mod3? The line
r = (r >> 2) + (r & 0x3);
can be found 2 times inmod3
. But why does 2 suffice? 
Consider the value
3
. Then((a) & 0x03)
will evaluate to 3 and will stay the same forever across iterations. Thus this algorithm is incorrect. A modular reduction for3
is missing. Why?
Consider the previous article. If the remainder showed \(d_0  d_1\), we actually get the correct result. But our approach to remove any negative digit in the block representation breaks correctness. This was necessary, because we would not be able to write the corresponding C operation. There is no possibility to represent one bit of an integer negatively.
So, we can improve problem #1, we need to analyze question #2 and fix problem #3.
#1: Reduce the number of shifts and additions
The idea is simple: We reduce by 255 first, then by 15, then by 3. This way, the chunks are larger and less operations are needed. Why these numbers? Because the numbers plus 1 is a multiple of our previous chunk size 2.
#2: Number of iterations
We annotate the source code:
r = (a >> 8) + (a & 0xff); // {9 bits} ∈ [0, 510] = {8 bits} + {8 bits};
r = (r >> 4) + (r & 0xf); // {6 bits} ∈ [0, 45] = {5 bits} + {4 bits};
r = (r >> 2) + (r & 0x3); // {4 bits} ∈ [0, 13] = {4 bits} + {2 bits};
r = (r >> 2) + (r & 0x3); // {3 bits} ∈ [0, 5] = {2 bits} + {2 bits};
It is important to recognize that all points during the computation, the values computed are the correct values in ℤ_{3}. However, we only miss the modular reductions (i.e. we miss sum subtractions of 3 to get the representative in [0, 2]).
It is also interesting to recognize that an addition of a 4 bit value with a 2 bit value gives a 5 bit value. However, we need to consider the actual possible values. In this scenario, we see that the result in line 3 can only use up to 4 bits.
Without the last line, we would get some value between 0 and 13. With the last line, we get some value between 0 and 5. Only the latter case will be easy to handle when fixing the next problem.
#3: Fix missing reductions
Let r ∈ {0, 1, 2, 3, 4, 5}. Let r be unsigned, c and t signed 16 bits integers.
t = r  3;
c = t >> 15;
return (c&r) ^ (~c&t);
Then t ∈ {2, 1, 0, 1, 2, 3}. Now, if we shift a 16 bit signed integer by 15 position to the right. The signed semantics might be nonintuitive. 1 >> 15
gives binary 1111111111111111
(hex 0xFF
). On the other hand, a positive value shifted 15 times, yields 0. Hence, c ∈ {0, 0xFF}.
Now, we need a case distinction:
 If c is 0

then we evaluate
(0x00 & r) ^ (~0x00 & t)
which is0x00 ^ (0xFF & t)
which is t which is one of 0, 1, and 2.  else c is nonzero

then we evaluate
(0xFF & r) ^ (~0xFF & t)
which isr ^ (0x00 & t)
which is r one of {0, 1, 2}.
Let us summarize all possible values and their evaluations:
r 
0 
1 
2 
3 
4 
5 
t 
3 
2 
1 
0 
1 
2 
c 
0xFF 
0xFF 
0xFF 
0x00 
0x00 
0x00 
return 
0 
1 
2 
0 
1 
2 
r % 3 
0 
1 
2 
0 
1 
2 
Summary

The first part of
mod3
computes modulo 3 as explained in the previous article giving a value in [0, 5]. 
The second part maps [0, 5] to [0, 2] with bitwise operations.

Why the complexity?

In PQC, we try to write algorithms which will be used million times per day on devices. Saving runtime is crucial (though, it should not be the only metric, see djb’s “Cryptographic competitions”).

I measured about 170 cycles with rdtsc with gcc 9.3 on x86_64 for
immediate_mod3
. I measured about 95 cycles formod3
and about 60 cycles for% 3
. 
I measured 54, 34, and 38 cycles respectively with
O3
.


In PQC, we try to achieve leakage resilience. One basic building block is constant time execution.

We cannot use
% 3
because this will compile to some division (clang) or multiplication (gcc) assembly where runtime depends on the argument. 
Therefore we also cannot just use a conditional (
if (r == 3) return 0
) to map 3 to 0.


Conclusion

We analyzed a 12 LOC C function which is part of the reference implementation of one PQC scheme.

We have discovered arithmetic operations to compute modulo 3.

We have also discovered bitwise operations to map the value 3 to 0.

To find proper engineered bits of software is fun!