Motivation
In the NIST PQC Standardization Effort, generic implementations (“ref”) working on a wide range of platforms are desired. In the case of pqm4, we look for an optimized implementation for the ARM CortexM4 platform (“pqm4”).
What are the differences? What can be optimized?
A first note
The optimized implementation provided in the submission is equivalent to the reference implementation in the NIST submission.
The differences
To get the PQM4 implementation from the ntruhrss701 reference implementation, you need to apply the following changes:

Use uint8_t in the public API (and internally), not unsigned char* for arrays

Replace the
crypto_hash_
suffix withsha3_

rewrite packing and unpacking (en/decoding of keys from bytestring) in assembly

Put cmov.c’s cmov into verify.c. Furthermore verify.c features a verify function which is not actually called.

void owcpa_enc(unsigned char *c, const poly *r, const poly *m, const unsigned char *pk)
is redefined asvoid owcpa_enc(unsigned char *c, const unsigned char *rm, const unsigned char *pk)

A ToomCook (k=4) multiplier in Assembly is provided, generated by python software accompanying the “Faster multiplication in ℤ_{2^m}[X]” paper. Thi implementation contains three functions

schoolbook_11x11

karatsuba_176x176

polymul_asm


An NTT implementation in Assembly is provided as function
void Good_mul_768(uint16_t *res_coeffs, const uint16_t *small_coeffs, const uint16_t *big_coeffs)
andvoid poly_SignedZ3_Rq_mul(poly *r, const poly *a, const poly *b)
as replacement for the schoolbook multipliervoid poly_Rq_mul(poly *r, const poly *a, const poly *b)
. 
MODQ/mod3 operations are often done at different places

For inversion, the Schroeppel–Orman–O’Malley–Spatscheck "Almost Inverse" algorithm as described by Silverman in NTRU Tech Report #14 is used (namely
poly_R2_inv
andpoly_S3_inv
) 
sample_iid
moved from a separate file tosample_iid.c

A function
poly_Rq_mul_x_minus_1
is introduced. It applies polynomial reduction after multiplication R_q.
Conclusion
The most expensive parts of the implementation is computation of the inverse polynomial in a quotient ring as well as multiplication of polynomials. As one can see, they put considerable effort into optimization of these parts.