Differences in the NTRU ref and pqm4 implementations

Motivation

In the NIST PQC Standardization Effort, generic implementations (“ref”) working on a wide range of platforms are desired. In the case of pqm4, we look for an optimized implementation for the ARM Cortex-M4 platform (“pqm4”).

What are the differences? What can be optimized?

A first note

The optimized implementation provided in the submission is equivalent to the reference implementation in the NIST submission.

The differences

To get the PQM4 implementation from the ntruhrss701 reference implementation, you need to apply the following changes:

Use uint8_t in the public API (and internally), not unsigned char* for arrays
Replace the crypto_hash_ suffix with sha3_
rewrite packing and unpacking (en-/decoding of keys from bytestring) in assembly
Put cmov.c’s cmov into verify.c. Furthermore verify.c features a verify function which is not actually called.
void owcpa_enc(unsigned char *c, const poly *r, const poly *m, const unsigned char *pk) is redefined as void owcpa_enc(unsigned char *c, const unsigned char *rm, const unsigned char *pk)
A Toom-Cook (k=4) multiplier in Assembly is provided, generated by python software accompanying the “Faster multiplication in ℤ_{2^m}[X]” paper. Thi implementation contains three functions
- schoolbook_11x11
- karatsuba_176x176
- polymul_asm
An NTT implementation in Assembly is provided as function void Good_mul_768(uint16_t *res_coeffs, const uint16_t *small_coeffs, const uint16_t *big_coeffs) and void poly_SignedZ3_Rq_mul(poly *r, const poly *a, const poly *b) as replacement for the schoolbook multiplier void poly_Rq_mul(poly *r, const poly *a, const poly *b).
MODQ/mod3 operations are often done at different places
For inversion, the Schroeppel–Orman–O’Malley–Spatscheck "Almost Inverse" algorithm as described by Silverman in NTRU Tech Report #14 is used (namely poly_R2_inv and poly_S3_inv)
sample_iid moved from a separate file to sample_iid.c
A function poly_Rq_mul_x_minus_1 is introduced. It applies polynomial reduction after multiplication R_q.

Conclusion

The most expensive parts of the implementation is computation of the inverse polynomial in a quotient ring as well as multiplication of polynomials. As one can see, they put considerable effort into optimization of these parts.