# Papers

Due to my time at university, I read a bunch of papers. Since the beginning of my PhD studies, I systematically review and summarize them. Here are my notes.

• § A Masked Ring-LWE Implementation
• § A Side-channel Resistant Implementation of SABER
• § Aggregated Private Information Retrieval
• § BasicBlocker: Redesigning ISAs to Eliminate Speculative-Execution Attacks
• § Cyclone: A safe dialect of C
• § Everything Old is New Again: Binary Security of WebAssembly
• § FastSpec: Scalable Generation and Detection of Spectre Gadgets Using Neural Emb…
• § High-speed Instruction-set Coprocessor for Lattice-based Key Encapsulation Mech…
• § Historical Notes on the Fast Fourier Transform
• § Number "Not" Used Once - Key Recovery Fault Attacks on LWE Based Lattice Crypto…
• § On the criteria to be used in decomposing systems into modules
• § PDF/A considered harmful for digital preservation
• § Piret and Quisquater’s DFA on AES Revisited
• § SEVurity: No Security Without Integrity
• § Software-based Power Side-Channel Attacks on x86
• § Templates vs. Stochastic Methods: A Performance Analysis for Side Channel Crypt…
• § The design of a Unicode font
• § Too Much Crypto
• § Tweaks and Keys for Block Ciphers: The TWEAKEY Framework
• § When a Patch is Not Enough - HardFails: Software-Exploitable Hardware Bugs

## Papers and notes

### A Masked Ring-LWE Implementation §

Title: “A Masked Ring-LWE Implementation” by Oscar Reparaz, Sujoy Sinha Roy, Frederik Vercauteren, Ingrid Verbauwhede [url] [dblp]
Published in 2015 at CHES 2015 and I read it in 2020/07
Abstract: Lattice-based cryptography has been proposed as a postquantum public-key cryptosystem. In this paper, we present a masked ring-LWE decryption implementation resistant to ﬁrst-order side-channel attacks. Our solution has the peculiarity that the entire computation is performed in the masked domain. This is achieved thanks to a new, bespoke masked decoder implementation. The output of the ring-LWE decryption are Boolean shares suitable for derivation of a symmetric key. We have implemented a hardware architecture of the masked ring-LWE processor on a Virtex-II FPGA, and have performed side channel analysis to conﬁrm the soundness of our approach. The area of the protected architecture is around 2000 LUTs, a 20 % increase with respect to the unprotected architecture. The protected implementation takes 7478 cycles to compute, which is only a factor ×2.6 larger than the unprotected implementation.

ambiguity

$\operatorname{th}(x) = \begin{cases} 0 & \text{if } x \in (0, q/4) \cup (3q/4, q) \\ 1 & \text{if } x \in (q/4, 3q/4) \end{cases}$

The intervals apparently should denote inclusive-exclusive boundaries.

error

On page 686, c1 and c2 is swapped erroneously all the time. On page 685, c2 is defined such that it contains the message. Consequently, factor r2 must be applied to c1, not c2. However, page 686 repeatedly multiplies c2 to the different r values.

Sorry, I don't have table support in Zotero.

First line is meant to be read as “If a' is in quadrant (1) and a'' is in quadrant (1), then a is inconclusive (∅) unlike bit 0 or 1”

a' a'' a

1 1 ∅
1 2 1
1 3 ∅
1 4 0
2 1 1
2 2 ∅
2 3 0
2 4 ∅
3 1 ∅
3 2 0
3 3 ∅
3 4 1
4 1 0
4 2 ∅
4 3 1
4 4 ∅

quotes

• “The area of the protected architecture is around 2000 LUTs, a 20 % increase with respect to the unprotected architecture. The protected implementation takes 7478 cycles to compute, which is only a factor ×2.6 larger than the unprotected implementation.”
• “So far, the reported implementations have focused mainly on efficient implementation strategies, and very little research work has appeared in the area of side channel security of the lattice-based schemes.”
• “Most notably, masking is both a provably sound and popular in industry.”
• “However, there are not many masking schemes specifically designed for postquantum cryptography. In Brenner et al. present a masked FPGA implementation of the post-quantum pseudo-random function SPRING.”
• “In the rest of the paper, we focus on protecting the ring-LWE decryption operation against side-channel attacks with masking. The decryption algorithm is considerably exposed to DPA attacks since it repeatedly uses long-term private keys. In contrast, the encryption or key-generation procedures use ephemeral secrets only”
• “We analyze the error rates of the decryption operation in Sect. 6 and apply error correcting codes.”
• “the message is first lifted to a ring element m̄ ∈ R q by multiplying the messagebits by q/2.”
• “The most natural way to split the computation of the decryption as Eq. 2 is to split the secret polynomial r additively into two shares r' and r'' such that r[i] = r'[i] + r''[i] (mod q) for all i.”
• “The final threshold th(·) operation of Eq. 2 is obviously non-linear in the base field Fq, and hence cannot be independently applied to each branch”
• “For instance, in [4] an approach based on masked tables was used.”
• “We design a bespoke masked decoder that results in a very compact implementation.”
• “a denotes a single coefficient and (a', a'') its shares such that a' + a'' = a (mod q).”
• “In roughly half of the cases, we can apply one of the 8 rules previously described to deduce the value of th(a).”
• “a' ← a' + Δ1 and a'' ← a'' − Δ1 for certain Δ1
• “See the extended version of this paper for exemplary values of Δi.” → http://www.reparaz.net/oscar/ches2015-lwe → 404 Not Found
• Masked Table Lookup: “This is a well-studied problem that arises in other situations (for instance, when masking the sbox lookup in a typical block cipher) and there are plenty of approaches here to implement such masked table lookup. We opted for the approach of masked tables as in [26].”
• “The usual precautions are applied when implementing f. For our target FPGA platform, we carefully split the 7-bit input to 1-bit output function f into a balanced tree of 4-bit input LUTs, in such a way that any intermediate input or output of LUTs does not leak in the first order.”
• “In Table 1, we can see that the proposed masking of the ring-LWE architecture incurs an additional area overhead of only 301 LUTs and 129 FFs in comparison to the unprotected version.”
• “we could straightforward reduce the additional area cost by reusing the 13-bit addition and subtraction circuits present in the arithmetic coprocessor. […] For simplicity, we did not implement this approach.”
• “Thus in total, a masked decryption operation requires 7478 cycles. The arithmetic coprocessor and the masked decoder run in constant time and constant flow.”
• “We point out that the approach laid out in Sect. 3 scales quite well with the security order. To achieve security at level d+1, one would need to split the computation of Eq. 2 into d branches analogously to Eq. 3.”
• “We provide a very advantageous setting for the adversary: we assume that the evaluator knows the details about the implementation (for example, pipeline stages).”
• “The evaluation methodology to test if the masking is sound is as follows. We first proceed with first-order key-recovery attacks when the randomness source (PRNG) is switched off. […] Then we switch on the PRNG and repeat the attacks. If the masking is sound, the first-order attacks shall not succeed. In addition, we perform second-order attacks to confirm that the previous first-order analyses were carried out with enough traces.”
• “We can see that starting from ≈2000 measurements this second-order attack is successful.”
• “We remark that the relatively low number of traces required for the second-order attack is due to the very friendly scenario for the evaluator. The platform is low noise and no other countermeasure except than masking was implemented.”

summary

Nice paper. It is sad that c1 and c2 got mixed up on page 686. The idea to mask is indeed natural with r2 = r2' + r2'' (mod q) and a' := a' + Δ1 for the decoder a'' := a'' - Δ1. Isn't sampling also a problem when masking ring-LWE? If so, the title is inappropriate and should be “A Masked Ring-LWE Decoder Implementation”. The described implementation works for CPA-only. It gives a factor of 2.679 in terms of CPU cycles and 3.2 in terms of runtime in microseconds in the decryption step. With N=16 (maximum number of iterations in the decoder), you get 1 error in about 400 000 bits. This means in NewHope's 256 bit encapsulated keys, you get 1 error in 1420 messages. I did not understand why the masked decoder requires the value of the previous iteration (loop in Figure 3) for a long time. Then I recognized. I don't like the fact that the source code was not published with the paper (→ reproducible research).

typo

• “In our implementation, N = 16 iterations produces a satisfactory result.”
• “essentially maps the output of each quadrant qi' and qi'' (2 bits each) after the i-the iteration”

### A Side-channel Resistant Implementation of SABER §

Title: “A Side-channel Resistant Implementation of SABER” by Michiel Van Beirendonck, Jan-Pieter D'Anvers, Angshuman Karmakar, Josep Balasch, Ingrid Verbauwhede [url] [dblp]
Published in 2020 at IACR eprint 2020 and I read it in 2020-11

open questions

• “Saber.Masked.KEM.Decaps does not require multiplication of two masked polynomials, which is a significantly more expensive computation.”
It doesn't?
• “A possible countermeasure is to randomize the order of execution of these vulnerable routines. Randomness should be used to shuffle the order of operations in Saber’s multiplication or introduce dummy operations.”
Is there an actual implementation?

quotes

• “This work describes a side-channel resistant instance of Saber, one of the lattice-based
candidates, using masking as a countermeasure.”
• “NIST has already announced that, in the second round, more stress will be put on implementation aspects.”
• “The security of both problems relies on introducing noise into a linear equation. However, in LWE-based schemes the noise is explicitly generated and added to the equation, while the LWR problem introduces noise through rounding of some least significant bits.”
• “In our masked implementation of Saber, we develop a novel primitive to perform masked logical shifting on arithmetic shares.”
• “Furthermore, Saber avoids excessive noise sampling due to its choice for LWR.”
• “We integrate and profile our masked CCA-secure decapsulation in the PQM4 [KRSS] post-quantum benchmark suite for the Cortex-M4, showing our close-to-ideal 2.5x overhead in CPU cycles. This factor can directly be compared to the overhead factor 5.7x reported by Oder et al., which is the work most closely related to ours, and we show that it can largely be attributed to the masking-friendly design choices of Saber.”
• “We say that a PKE is δ-correct if P[Decrypt(sk, ct) ≠ m : ct ← Encrypt(pk, m)] ≤ δ.” (i.e. smaller is better)
• “The Saber package is based on the Module Learning With Rounding (MLWR) problem, and its security can be reduced to the security of this problem. MLWR is a variant of the well known Learning With Errors (LWE) problem [Reg04], which combines a module structure as introduced by Langlois and Stehlé [LS15] with the introduction of noise through rounding as proposed by Banerjee et al. [BPR12].”
• “The additions with the constant terms h 1 , h 2 and h are needed to center the errors introduced by rounding around 0, which reduce the failure probability of the protocol.”
• “The reason being that even without side-channel information, the Saber.PKE is vulnerable to chosen-ciphertext attacks if the secret key is re-used, which was shown by Fluhrer [Flu16] to be the case for all current LWE-based and LWR-based IND-CPA secure encryption schemes.”
• “Several secure A2B as well as B2A conversion algorithms exist. These generally come in two flavours, depending on whether the arithmetic shares use a power-of-two or a prime modulus. The former group have received considerably more research interest due to their use in symmetric primitives, and they are typically more efficient and simpler to implement.”
• “Finally, Schneider et al. [SPOG19] combine the previous two algorithms, and at the same time present a new algorithm, B2Aq, which works for arbitrary moduli as well as arbitrary security orders. However, when instantiated as a power-of-two conversion, e.g. q = 28, B2Aq only outperforms [BCZ18] and [CGV14] for more than nine shares.”
• “In the remainder of this section, we first describe the Coron-Tchulkine [CT03] table-based A2B algorithm, including the fix from [Deb12].”
• “Another similarity we share with [KMRV18] is the use of the ARM Cortex-M4’s support for SIMD instructions to speed up execution.”
• “We use the Test Vector Leakage Assessment (TVLA) methodology introduced by Goodwill et al. [GJJR11] in order to validate the security of our implementation.”
• “In our experiments we use a non-specific fix vs. random test.”
• “TVLA uses the Welch’s t-test to detect differences in the mean power consumption between the two sets.”
• “After 100 000 measurements, our t-test results for Saber.Masked.KEM.Decaps with masks ON still show some slight excursions past the ±4.5 confidence boundary. This is sometimes expected for long traces, and therefore, as per [GJJR11], we conduct a second independent t-test showing that these excursions are never at the same time instant.”
• “Our most efficient design is (D), where both A2A tables and the SecBitSlicedSampler are implemented.”
• “Masking has so far received limited attention in post-quantum cryptography, but will become increasingly important in the continuation of the NIST standardization process.”
• “Oder et al. do not present the dynamic memory consumption for an unmasked design, such that we only make the masked comparison for that performance metric.”
• “From Table 5, these two A2A conversions take roughly 60,000 CPU cycles, whereas masked sampling of four error polynomials from β μ would take approximately 1,026,000 CPU cycles. The high cost of masked binomial sampling is further illustrated in [OSPG18] (Table 2), where roughly 71% of the decapsulation’s CPU cycles are spent in the masked sampling routine.”
• “A possible countermeasure is to randomize the order of execution of these vulnerable routines. Randomness should be used to shuffle the order of operations in Saber’s multiplication or introduce dummy operations.”
• “A possible countermeasure is to randomize the order of execution of these vulnerable routines. Randomness should be used to shuffle the order of operations in Saber’s multiplication or introduce dummy operations.”

summary

• Table 2 is an illustration for ℤ4
• “From this table, it can be seen that the linear operations, i.e. polynomial arithmetic, have roughly a factor 2x overhead in the masked design, due to the duplication of every polynomial multiplication. Non-linear operations, on the other hand, have overhead factors ranging from 7x for A2A conversion to 23x for binomial sampling. Our design requires 5048 random bytes, and spends roughly 100,000 cycles sampling these from the TRNG”
• Saber.Masked.PKE.Dec takes 1.96 times as long in the masked version
Saber.Masked.PKE.Enc takes 2.48 times as long in the masked version

typo

• Symmetric crypto primitive names are printed in math mode, making the kerning terrible.
• Where Reparaz et al. successfully masked a Chosen-Plaintext Attack (CPA)-secure RLWE decryption, real-world applications typically require Chosen-Ciphertext Attack (CCA) secure primitives, which can be obtained using an appropriate CCA-transform.”
Whereas Reparaz et al. successfully masked a Chosen-Plaintext Attack (CPA)-secure RLWE decryption, real-world applications typically require Chosen-Ciphertext Attack (CCA) secure primitives, which can be obtained using an appropriate CCA-transform.”
• “A first-order masking splits any sensitive variable x in the algorithm into two shares x1 and x2, such that x = x1 x2, and perform all operations in the algorithm on the shares separately.”
“A first-order masking splits any sensitive variable x in the algorithm into two shares x1 and x2, such that x = x1 x2, and performs all operations in the algorithm on the shares separately.”
• “In grey the operations that are influenced by the long term secret s and thus vulnerable to side-channel attacks.”
“In grey are operations that are influenced by the long term secret s and thus vulnerable to side-channel attacks.”
• “Even though previous attacks have focused on schoolbook mulitplication …”
“Even though previous attacks have focused on schoolbook multiplication …”

Title: “Additively Homomorphic Ring-LWE Masking” by Oscar Reparaz, Ruan de Clercq, Sujoy Sinha Roy, Frederik Vercauteren, Ingrid Verbauwhede [url] [dblp]
Published in 2016 at PQCrypto 2016 and I read it in 2020-07

quotes

The most important statement is equation 3:

decryption(c1, c2) ⊕ decryption(c1’, c2’) = decryption(c1 + c1’, c2 + c2’)

• “we do not require a masked decoder but work with a conventional, unmasked decoder.”
• “Masking [CJRR99,GP99] is a provable sound countermeasure against DPA.”
• “A caveat of our approach is that we need to place additional assumptions on the underlying arithmetic hardware compared to the CHES 2015 approach.”
• “The operation ⊕ is the xor operation on bits or strings of bits.”
• “In the literature there are several encryption schemes based on the ring-LWE problem, for example [LPR10,FV12,BLLN13] etc.”
• “Among all the computations, polynomial multiplication is the costliest. Most of the reported implementations use the Number Theoretic Transform (NTT) to accelerate the polynomial multiplications.”
• “The proposed randomized decryption. To perform the decryption of (c1, c2) in a randomized way, the implementation follows the following steps:
1. Internally generate a random message m’ unknown to the adversary
2. Encrypt m’ to (c1’, c2’)
3. Perform decryption(c1 + c1’, c2 + c2’) to recover m ⊕ m’.
4. The masked recovered message is the tuple (m’, m ⊕ m’).”
• “This approach has the nice property of not requiring a masked decoder.”
• “The obvious disadvantage is that extra circuitry or code is required to perform the encryption. Another disadvantage is the increased decryption failure rate. When two ciphertexts are added, the amount of noise increases. The added noise increases the decryption failure rate as we will see in Sect. 4.3.”
• “Our countermeasure can be thought of as ciphertext blinding.”
• “Thus, straightforward first-order DPA attack does not immediately apply. Nevertheless, more refined first-order DPA attacks do apply.”
•  “In Appendix A we describe a strategy to detect whether s[i] = 0 or s[i] ≠ 0, which leads to an entropy loss.”
• “after all, Eq. 3 may seem to imply that the decoding function is linear. However, this is clearly not the case.”
• “When the masking is turned off, the decryption failure rate is 3.6 × 10-5 per bit. The failure rate increases to 3.3 × 10-3 per bit when the masking turned on.”
• “In terms of speed, the costliest process is the encryption. It is 2.8 times slower than the decryption.”

summary

A paper of rather low quality. The core essence is Equality 3: decryption(c1, c2) ⊕ decryption(c1’, c2’) = decryption(c1 + c1’, c2 + c2’). Besides that, there are many confusing statements. The workings of Equality 3 are barely mentioned (i.e. correctness of Equality 3 is IMHO insufficiently discussed), but should be understandable for everyone in the field. Correctness is non-obvious, because we have an addition on the RHS and XOR on the LHS. But it is trivial, because on the RHS, we consider ℤq whereas LHS uses ℤ2 and the XOR is addition in ℤ2. Unlike the CHES 2015 paper, no additional circuitry is needed, which makes it a super interesting approach. The failure rate increases by a factor of 92 (3.6 × 10-5 → 3.3 × 10-3 per bit), which is a difficult problem of its own, but given in the CHES 2015 approach as well.

In summary, the approach computes the encryption of the original message (⇒ (c1, c2)) and also the encryption of some random value (⇒ (c1’, c2’)). Then, we don't decrypt dec(c1, c2), but dec(c1 + c1’, c2 + c2’).

• “A caveat of our approach is that we need to place additional assumptions on the underlying arithmetic hardware compared to the CHES 2015 approach.”
• which ones? performance assumption on encryption?
• “Thus, straightforward first-order DPA attack does not immediately apply. Nevertheless, more refined first-order DPA attacks do apply.”
• “In particular, the practitioner should pay careful attention to leaking distances if implemented in software, since during the masked decoding both shares are handled in contiguous temporal locations.”
• undefined terminology “distances”
• “are handled in contiguous temporal locations” is not necessarily true (only likely true)
• “after all, Eq. 3 may seem to imply that the decoding function is linear. However, this is clearly not the case.”
• elaboration would be nice
• decryption failure is inherent to ring-LWE
• essentially, it is close to linearity
• but decryption is non-linear due to error
• and even neglecting the error as constant, the decryption failure is skewed here and thus linearity is not really given
• Figure 2 is incomprehensible and axes insufficiently explained
• Appendix A is vacuous. How can it be done? I guess the only statement is “it can be classified”, which is a trivial statement (considering template attacks) which does not need more justifications.

typo

• “The countermeasure makes harder the DPA attack” → “The countermeasure makes the DPA attack harder
• “Note that the distribution of […] when s = 0 and […] is uniform random is different from the distribution of […] when s = 0.” → “Note that the distribution of […] when s = 0 and […] is uniform random but is different from the distribution of […] when s = 0.”

### Aggregated Private Information Retrieval §

Title: “Aggregated Private Information Retrieval” by Lukas Helminger, Daniel Kales, Christian Rechberger [url] [dblp]
Published in 2020-05 at and I read it in 2020-07
Abstract: With the outbreak of the coronavirus, governments rely more and more on location data shared by European mobile network operators to monitor the advancements of the disease. In order to comply with often strict privacy requirements, this location data, however, has to be anonymized, limiting its usefulness for making statements about a ﬁltered part of the population, like already infected people.

quotes

• “In this research, we aim to assist with the disease tracking efforts by designing a protocol to detect coronavirus hotspots from mobile data while still maintaining compliance with privacy expectations.”
• “Governments in Italy, Germany, and Austria are relying on this metadata to monitor how people are complying with stay-at-home orders.”
• “we design a specialized private information retrieval (PIR) protocol.”
• “In this paper, we are interested in the single-server variants, namely computational PIR (CPIR), which rely on cryptographic hardness assumptions to hide the query from the server. Recent work has heavily improved on the original ideas of Chor et al. Many CPIR implementations use homomorphic encryption (HE) to hide the queries from the server while still allowing him to perform operations on the
query.”
• “HE is a cryptographic primitive that allows performing computations on encrypted data without knowing the secret decryption key.”
• “We assume without loss of generality that the first column of the database of the server consists of unique identifiers.”
• “The threat model of the APIR protocol is similar to PIR protocols, i.e., the server should not know which elements were retrieved by the client.”
• “To prevent the client from learning individual entries, we make sure that the client’s list of identifiers has a guaranteed minimum cardinality and that each identifier is unique.”
• “We report multithreaded runtimes of 30 minutes for the standard APIR protocol, and 1 hour when extra steps are applied to ensure the input vector is not malicious.”
• “The RSA encryption scheme is homomorphic for multiplication and Paillier’s cryptosystem is homomorphic for addition. However, it was not until Gentry’s groundbreaking work from 2009 that we were able to construct the first fully homomorphic encryption (FHE) scheme, a scheme which in theory can evaluate an arbitrary circuit on encrypted data. His construction is based on ideal lattices and is deemed to be too impractical ever to be used, but it led the way to construct more efficient schemes in many following publications”
• “Once this noise becomes too large and exceeds a threshold, the ciphertext cannot be decrypted correctly anymore. We call such a scheme a somewhat homomorphic encryption scheme (SHE), a scheme that allows evaluating an arbitrary circuit over encrypted data up to a certain depth.”
• “In his work, Gentry introduced the novel bootstrapping technique, a procedure that reduces the noise in a ciphertext and can turn a (bootstrappable) SHE scheme into an FHE scheme.”
• “In many practical applications it is, therefore, faster to omit bootstrapping and choose a SHE scheme with large enough parameters to evaluate the desired circuit.”
• “two different types of PIR: information theoretic PIR (IT-PIR) protocols which rely on multiple, non-colluding servers to ensure privacy; and
computational PIR (CPIR) where a single server manages the database and encryption is used hide the query.”
• “An established technique to achieve ε-differential privacy is the Laplace mechanism, i.e., to add Laplacian noise to the final result of the computation.”
• “H should not learn the movement pattern of any individual. More concretely, H is not allowed to query the location data for less than W different people. W has to be chosen in such a way that the data aggregation provides anonymity and its exact value will highly depend on the actual underlying data, which is the reason why we do not give a generic value in this paper.”
• “First, M could try to find the place of residence by assuming people sleep at home.”
• […] “remove those places for the heat map creation.”
• “The computationally most expensive phase in the protocol is the Data Aggregation phase, in which the server multiplies a huge matrix to a homomorphically encrypted input vector.”

summary

The paper implements a nice idea. The usefulness of the usecase needs to be discussed with public health experts (what does it help if I know that many infected people live in this block of houses?). However, I have been told the entire paper was written in 1 month and that is quite impressive considering the technical depth in the field of Homomorphic Encryption.

There are many typos and to me, the main purpose of the protocol in Figure 1 was not comprehensible before talking to the authors. I also didn't understand in which ways ε-Differential Privacy is desirable or how it can be ensured and which definition they used for “vector c is binary” before going into details in section 3.2. Apparently, a binary vector is desirable to prevent leakage. For Figure 2, they used the Teχ package cryptocode to illustrate the protocol. To the best of my understanding, this is just a reiteration of Figure 2. On page 13, the paragraph “Note that, as we have already mentioned in Section 3.6” should be moved to the concluding remarks. On page 14, it is unclear, what “isolated profiles” are. I didn't go through the details of section 5.

### BasicBlocker: Redesigning ISAs to Eliminate Speculativ… §

Title: “BasicBlocker: Redesigning ISAs to Eliminate Speculative-Execution Attacks” by Jan Philipp Thoma, Jakob Feldtkeller, Markus Krausz, Tim Güneysu, Daniel J. Bernstein [url] [dblp]
Published in 2020-07 at and I read it in 2020-08
Abstract: Recent research has revealed an ever-growing class of microarchitectural attacks that exploit speculative execution, a standard feature in modern processors. Proposed and deployed countermeasures involve a variety of compiler updates, ﬁrmware updates, and hardware updates. None of the deployed countermeasures have convincing security arguments, and many of them have already been broken.

Comment: Preprint

quotes

• “The obvious way to simplify the analysis of speculative-execution attacks is to eliminate speculative execution. This is normally dismissed as being unacceptably expensive, but the underlying cost analyses consider only software written for current instruction-set architectures, so they do not rule out the possibility of a new instruction-set architecture providing acceptable performance without speculative execution.”
• “The IBM Stretch computer in 1961 automatically speculated that a conditional branch would not be taken: it began executing instructions after the conditional branch, and rolled the instructions back if it turned out that the conditional branch was taken.”
• “Software analyses in the 1980s such as [12] reported that programs branched every 4–6 instructions.”
• “The penalty for mispredictions grew past 10 cycles. Meanwhile the average number of instructions per cycle grew past two, so the cost of each mispredicted branch was more than 20 instructions.”
• “The P-cycle branch-misprediction cost is the time from early in the pipeline, when instructions are fetched, to late in the pipeline, when a branch instruction computes the next program counter.”
• “A branch delay slot means that a branch takes effect only after the next instruction.”
• “The recent paper [49] introduces a RISC-V extension to flush microarchitectural state and shows that the extension stops several covert channels.”
• “Another approach to ISA modifications against transient-execution attacks is to explicitly tell the CPU which values are secret, and to limit the microarchitectural operations that can be carried out on secret values [35, 50, 51].”
• “The standard separation of fetch from decode also means that every instruction is being speculatively fetched.”
• “Program code can be divided into basic blocks. To do so, all possible control flow paths are mapped into a directed graph called Control Flow Graph (CFG) so that each edge of the CFG represents a control flow from one instruction to the next. If two vertices A and B are connected by a single edge (A → B) with A having only a single outbound edge and B having only a single inbound edge, the vertices are merged if the two instructions are sequential in memory.”
• “The microarchitectural state of a CPU is affected only by instructions that will eventually be retired.”
• “The most important implication of this goal is that the CPU must abandon any speculative behavior. This eliminates a major source of complexity inside the security analysis of modern CPUs.”
• “Within this basic block, the CPU is allowed to fast-fetch instructions, knowing that upcoming instructions can be found in a sequential order in memory and will definitely be executed. That is, since per definition, within the basic block, no control flow changes can occur. The instruction further provides information whether the basic block is sequential, stating that the control flow continues with the next basic block in the sequence in memory. If a basic block does not contain a control-flow instruction it is therefore sequential.”
• “We also modified the behavior of existing control-flow instructions, such as bne, j and jlre.”
• “If the current basic block does not contain a control-flow instruction, which is indicated by the sequential flag of the bb instruction, the CPU can fetch the next bb instruction directly.”
• “A processor supporting the bb instruction is required to have an instruction counter IC, a target register T , a branch flag B, and an exception flag E, all initialized to 0 on processor reset and used only as defined below.”
• “We can seamlessly support hardware loop counters in our design concept. One new instruction (lcnt) is necessary to store the number of loop iterations into a dedicated register.”
• “BasicBlocker can also be used as a form of coarse-grained CFI [1,2], as it only allows control flow changes to beginnings of basic blocks, indicated by a bb instruction. This reduces
the prospects of success for (JIT-)ROP [36, 37] attacks, as the variety of potential gadgets is reduced.”
• “One could easily extend our design to larger, more complex CPUs.”
• “The more parallel pipelines a CPU has, the more important it gets to build software with large basic blocks, as small basic blocks with branch dependent instructions cause a higher performance loss on a multi-issue CPU compared to a single-issue CPU.”
• “It would also be possible to integrate our solution into a secure enclave by providing a modified fetch unit for the enclave.”
• “The bb instruction does not fit into any of the existing RISC-V instruction types so that we define a new instruction type to achieve an optimal utilization of the instruction bits (Figure 6).”
• “We used a configuration with five stages (IF, ID, EX, MEM, WB) and 4096 byte, one-way instruction- and data cache.”
• “Our compiler is based on the LLVM [28] Compiler Framework version 10.0.0, where we modified the RISC-V backend by introducing our ISA extension and inserting new compilation passes at the very end of the compilation pipeline to not interfere with other passes that do not support our new instructions.”
• “Linker relaxation, however, is one optimization that could reduce the number of instructions by substituting calls with a short jumping distance by a single jump instruction instead of two instructions (aupic and jalr).”
• “Highly optimized code, such as cryptographic libraries, are barely affected by branch prediction at all.”
• “Throughout our benchmarks, the average code size overhead was 17%.”
• “In some cases, BasicBlocker even outperforms sophisticated branch predictors.”

summary

Quite good paper.

“Preventing speculative execution exploits” conjured up more sophisticated expectations for my part, but in general the idea is legit and the research was done properly. Essentially, the additional bb instruction annotates how many following instructions do not contain control flow instructions which allows a processor to prefetch them without any speculative considerations. An additional lcnt instruction for RISC-V handles loops.

In the reading group it was criticized that formal definitions were cumbersome and inappropriate, but I consider it a strong suit that the idea was not only presented, but formalized. I think the negative aspects are that some statements are not properly attributed; like Observation 1 which is not a contribution by this paper, but prior work. Furthermore statements like “One could easily extend our design to larger, more complex CPUs” seem unjustified. Intel Skylake CPUs are built with 19 stage pipelines, which certainly shows different performance metrics. So more research into the relation of number of pipeline stages and performance is required.  A new instruction type for RISC-V is also a non-trivial modification in practice. The corresponding code is not yet published, but the paper was just released 1 month prior reading. Another weak point is selling, which focuses on the good performance of BasicBlocker in the nettle-aes and nettle-sha benchmarks. As cryptographic applications, these obviously do not rely on branching except for “for loops”, which is completely non-representative, but statements like “In some cases, BasicBlocker even outperforms sophisticated branch predictors” can be found. Finally, Claudio pointed out very well, that JIT compilation cannot be implemented with BB since the number of instructions of a block are most likely not known.

Overall, a quite good paper, but less aggressive selling would have increased reputation from my perspective.

• A branch delay slot detects a lack of data dependency and thus moves the instruction after the jump before the jump. This is an older concept than out-of-order execution.
• Definitions (page 4):
• Definition 1: Microarchitectural Effects
• Definition 2: Retired Instructions
• Definition 3: Instruction Stream
• Definition 4: Control Flow Instructions
• Definition 5: Basic Block
• Definition 6: t-security
• Definitions (page 5 & 7):
• Definition 7: Hardware secure processor
• Definition 8: BB Instruction
• Definition 9: BB Delayed Branches
• Definition 10: BB Exceptions
• Definition 11: BB Required
• Definition 12: BB Prefetching
• “It would also be possible to integrate our solution into a secure enclave by providing a modified fetch unit for the enclave.”
• “configuration with five stages 4096 byte, one-way instruction- and data cache.”
• “Obviously, BasicBlocker slightly increases the code size as every basic block is extended by an additional instruction” with “average code size overhead was 17%”
• Gem5 project: “The gem5 simulator is a modular platform for computer-system architecture research, encompassing system-level architecture as well as processor microarchitecture. gem5 is a community led project with an open governance model”
• https://en.wikichip.org/wiki/WikiChip

### Cyclone: A safe dialect of C §

Title: “Cyclone: A safe dialect of C” by Greg Morrisett, James Cheney, Dan Grossman, Michael Hicks, Yanling Wang [url] [dblp]
Published in 2002 at USENIX 2002 and I read it in 2020-02
Abstract: Cyclone is a safe dialect of C. It has been designed from the ground up to prevent the buffer overflows, format string attacks, and memory management errors that are common in C programs, while retaining C’s syntax and semantics. This paper examines safety violations enabled by C’s design, and shows how Cyclone avoids them, without giving up C’s hallmark control over low-level details such as data representation and memory management.

Ambiguity

“Arrays and strings are converted to ?-pointers as necessary (automatically by the compiler).”

→ When exactly?

Example

“We don’t consider it an error if non-pointers are uninitialized. For example, if you declare a local array of non-pointers, you can use it without initializing the elements:

char buf[64];       // contains garbage ..
sprintf(buf,"a");   // .. but no err here
char c = buf[20]; // .. or even here

This is common in C code; since these array accesses are in-bounds, we allow them.”

Example

“However, this technique will not catch even the following simple variation:

char *itoa(int i) {
char buf[20];
char *z;
sprintf(buf,"%d",i);
z = buf;
return z;
}

Here, the address of buf is stored in the variable z, and then z is returned. This passes gcc -Wall without complaint.”

Quotes

• “Cyclone is a safe dialect of C. It has been designed from the ground up to prevent the buffer overflows, format string attacks, and memory management errors that are common in C programs, while retaining C’s syntax and semantics. This paper examines safety violations enabled by C’s design, and shows how Cyclone avoids them, without giving up C’s hallmark control over low-level details such as data representation and memory management.”
• “Every introductory C programming course warns against them and teaches techniques to avoid them, yet they continue to be announced in security bulletins every week. There are reasons for this that are more fundamental than poor training:
• One cause of buffer overflows in C is bad pointer arithmetic, and arithmetic is tricky. […]
• C uses NUL-terminated strings. […]
• Out-of-bounds pointers are commonplace in C. […]”
• “Our goal is to design Cyclone so that it has the safety guarantee of Java (no valid program can commit a safety violation) while keeping C’s syntax, types, semantics, and idioms intact.”
• “We must reject some safe programs,because it is impossible to implement an analysis that perfectly separates the safe programs from the unsafe programs.”
• Table 1: “Restrictions imposed by Cyclone to preserve safety”
• NULL checks are inserted to prevent segmentation faults
• Pointer arithmetic is restricted
• Pointers must be initialized before use
• Dangling pointers are prevented through region analysis and limitations on free
• Only “safe” casts and unions are allowed
• goto into scopes is disallowed
• switch labels in different scopes are disallowed
• Pointer-returning functions must execute return
• setjmp and longjmp are not supported
• Table 2: “Extensions provided by Cyclone to safely regain C programming idioms”
• Never-NULL pointers do not require NULL checks
• “Fat” pointers support pointer arithmetic with run-time bounds checking
• Growable regions support a form of safe manual memory management
• Tagged unions support type-varying arguments
• Injections help automate the use of tagged unions for programmers
• Polymorphism replaces some uses of void *
• Varargs are implemented with fat pointers
• Exceptions replace some uses of setjmp and longjmp
• “If you call getc(NULL), what happens? The C standard gives no definitive answer.”
• “Cyclone’s region analysis is intraprocedural — it is not a whole-program analysis.”
• “Here ‘r is a region variable.” (c.f. rust's notation)
• “Obviously, programmers still need a way to reclaim heap-allocated data. We provide two ways. First, the programmer can use an optional garbage collector. This is very helpful in getting existing C programs to port to Cyclone without many changes. However, in many cases it constitutes an unacceptable loss of control.”
• “A goto that does not enter a scope is safe, and is allowed in Cyclone. We apply the same analysis to switch statements, which suffer from a similar vulnerability in C.”
• “The Cyclone compiler is implemented in approximately 35,000 lines of Cyclone. It consists of a parser, a static analysis phase, and a simple translator to C. We use gcc as a back end and have also experimented with using Microsoft Visual C++.”
• “When a user compiles with garbage collection enabled, we use the Boehm-Demers-Weiser conservative garbage collector as an off-the-shelf component.”
• “We achieve near-zero overhead for I/O bound applications such as the web server and the http programs, but there is a considerable overhead for computationally-intensive benchmarks;”
• “Cyclone’s representation of fat pointers turned out to be another important overhead. We represent fat pointers with three words: the base address, the bounds address, and the current pointer location (essentially the same representation used by Mc-Gary’s bounded pointers [26]).”
• “Good code generation can make a big difference: we found that using gcc’s -march=i686 flag increased the speed of programs making heavy use of fat pointers (such as cfrac and grobner) by as much as a factor of two, because it causes gcc to use a more efficient implementation of block copy.”
• “We found array bounds violations in three benchmarks when we ported them from C to Cyclone: mini_httpd, grobner, and tile. This was a surprise, since at least one (grobner) dates back to the mid 1980s.”
• “Cyclone began as an offshoot of the Typed Assembly Language (TAL) project”
• “In C, a switch case by default falls through to the next case, unless there is an explicit break. This is exactly the opposite of what it should be: most cases do not fall through, and, moreover, when a case does fall through, it is probably a bug. Therefore, we added an explicit fallthru statement,” […] “Our decision to “correct” C’s mistake was wrong. It made porting error-prone because we had to examine every switch statement to look for intentional fall throughs, and add a fallthru statement.”
• “There is an enormous body of research on making C safer. Most techniques can be grouped into one of the following strategies:”
1. Static analysis. […]
2. Inserting run-time checks. […]
3. Combining static analysis and run-time checks. […]

Region blocks

“Therefore, Cyclone provides a feature called growable regions. The following code declares a growable region, does some allocation into the region, and deallocates theregion:

region h {
int *x = rmalloc(h,sizeof(int));
int ?y = rnew(h) { 1, 2, 3 };
char ?z = rprintf(h,"hello");
}

The code uses a region block to start a new, growable region that lives on the heap. The region is deallocated on exit from the block (without an explicit free).”

Summary

Great paper with pragmatic approach derived from the Typed Assembly Language project. Definitely worth a read. Essential for everyone interested in the Rust programming language as this project inspired many ideas related to proper memory management and lifetimes. Implementation needs to be compiled on your own, but it is not maintained anymore anyways. Furthermore, there are more papers following the progress of the project and they also introduced more drastic changes which discards the label “pragmatic”.

Tagged unions

“We solve this in Cyclone in two steps. First, we add tagged unions to the language:”

tunion t {
Int(int);
Str(char ?);
};

[…]

void pr(tunion t x) {
switch (x) {
case &Int(i): printf("%d",i); break;
case &Str(s): printf("%s",s); break;
}
}

“The printf function itself accesses the tagged arguments through a fat pointer (Cyclone’s varargs are bounds checked)”

### Everything Old is New Again: Binary Security of WebAss… §

Title: “Everything Old is New Again: Binary Security of WebAssembly” by Daniel Lehmann, Johannes Kinder, Michael Pradel [url] [dblp]
Published in 2020 at USENIX Security 2020 and I read it in 2020-07
Abstract: WebAssembly is an increasingly popular compilation target designed to run code in browsers and on other platforms safely and securely, by strictly separating code and data, enforcing types, and limiting indirect control ﬂow. Still, vulnerabilities in memory-unsafe source languages can translate to vulnerabilities in WebAssembly binaries. In this paper, we analyze to what extent vulnerabilities are exploitable in WebAssembly binaries, and how this compares to native code. We ﬁnd that many classic vulnerabilities which, due to common mitigations, are no longer exploitable in native binaries, are completely exposed in WebAssembly. Moreover, WebAssembly enables unique attacks, such as overwriting supposedly constant data or manipulating the heap using a stack overﬂow. We present a set of attack primitives that enable an attacker (i) to write arbitrary memory, (ii) to overwrite sensitive data, and (iii) to trigger unexpected behavior by diverting control ﬂow or manipulating the host environment. We provide a set of vulnerable proof-of-concept applications along with complete end-to-end exploits, which cover three WebAssembly platforms. An empirical risk assessment on real-world binaries and SPEC CPU programs compiled to WebAssembly shows that our attack primitives are likely to be feasible in practice. Overall, our ﬁndings show a perhaps surprising lack of binary security in WebAssembly. We discuss potential protection mechanisms to mitigate the resulting risks.

quotes

• “We find that many classic vulnerabilities which, due to common mitigations, are no longer exploitable in native binaries, are completely exposed in WebAssembly. Moreover, WebAssembly enables unique attacks, such as overwriting supposedly constant data or manipulating the heap using a stack overflow.”
• “WebAssembly is an increasingly popular bytecode language that offers a compact and portable representation, fast execution, and a low-level memory model. Announced in 2015 and implemented by all major browsers in 2017, WebAssembly is supported by 92% of all global browser installations as of June 2020. The language is designed as a compilation target, and several widely used compilers exist, e.g., Emscripten for C and C++, or the Rust compiler, both based on LLVM”
• “There are two main aspects to the security of the WebAs-sembly ecosystem: (i) host security, the effectiveness of the runtime environment in protecting the host system against malicious WebAssembly code; and (ii) binary security, the effectiveness of the built-in fault isolation mechanisms in preventing exploitation of otherwise benign WebAssembly code”
• “Comparing the exploitability of WebAssembly binaries with native binaries, e.g., on x86, shows that WebAssembly re-enables several formerly defeated attacks because it lacks modern mitigations. One example are stack-based buffer overflows, which are effective again because WebAssembly binaries do not deploy stack canaries.”
• “The original WebAssembly paper addresses this question briefly by saying that “at worst, a buggy or exploited WebAssembly program can make a mess of the data in its own memory”
• “Regarding data-based attacks, we find that one third of all functions make use of the unmanaged (and unprotected) stack in linear memory. Regarding control-flow attacks, we find that every second function can be reached from indirect calls that take their target directly from linear memory. We also compare WebAssembly’s type-checking of indirect calls with native control-flow integrity defenses.”
• “There are four primitive types: 32 and 64 bit integers (i32 , i64) and single and double precision floats (f32 , f64). More complex types, such as arrays, records, or designated pointers do not exist.”
• “Branches can only jump to the end of surrounding blocks, and only inside the current function. Multi-way branches can only target
blocks that are statically designated in a branch table. Unrestricted gotos or jumps to arbitrary addresses are not possible. In particular, one cannot execute data in memory as bytecode instructions.”
• “The call_indirect instruction on the left pops a value from the stack, which it uses to index into the so called table section. Table entries map this index to a function, which is subsequently called. Thus, a function can only be indirectly called if it is present in the table.”
• “In contrast to other byte-code languages, WebAssembly does not provide managed memory or garbage collection. Instead, the so called linear memory is simply a single, global array of bytes.”
• “One of the most basic protection mechanisms in native programs is virtual memory with unmapped pages. A read or write to an unmapped page triggers a page fault and terminates the program, hence an attacker must avoid writing to such addresses. WebAssembly’s linear memory, on the other hand, is a single, contiguous memory space without any holes, so every pointer ∈ [0, max_mem] is valid. […] This is a fundamental limitation of linear memory with severe consequences. Since one cannot install guard pages between static data, the unmanaged stack, and the heap, overflows in one section can silently corrupt data in adjacent sections.”
• “In WebAssembly, linear memory is non-executable by design, as it cannot be jumped to.”
• “an overflow while writing into a local variable on the unmanaged stack, e.g., buffer, may overwrite other local variables in the same and even in other stack frames upwards in the stack”
• “Because in WebAssembly no default allocator is provided by the host environment, compilers include a memory allocator as part of the compiled program”
• “While standard allocators, such as dlmalloc, have been hardened against a variety of metadata corruption attacks, simplified and lightweight allocators are often vulnerable to classic attacks. We find both emmalloc and wee_alloc to be vulnerable to metadata corruption attacks, which we illustrate for a version of emmalloc in the following.”
• “As WebAssembly has no way of making data immutable in linear memory, an arbitrary write primitive can change the value of any non-scalar constant in the program, including, e.g., all string literals.”
• “Version 1.6.35 of libpng suffers from a known buffer overflow vulnerability (CVE-2018-14550 [3]), which can be exploited when converting a PNM file to a PNG file. When the library is compiled to native code with modern compilers on standard settings, stack canaries prevent this vulnerability from being exploited. In WebAssembly, the vulnerability can be exploited unhindered by any mitigations.”
• “While exec and the log_* functions have different C++ types, all three functions have identical types on the WebAssembly level (Figure 8b). The reason is that both integers and pointers are represented as i32 types in WebAssembly, i.e., the redirected call passes WebAssembly’s type check.”
• “To the best of our knowledge, it is the first security analysis tool for WebAssembly binaries. The analysis is written in Rust”
• “For example, with ten nested calls (assuming a uniform distribution of functions), there would be some data on the unmanaged stack with 1 − ((1 − 0.33) 10 ) ≈ 98.2% probability.”
• “Averaged over all 26 programs, 9.8% of all call instructions are indirect,”
• “[…] how many functions are type-compatible with at least one call_indirect instruction and present in the table section. The percentage of indirectly callable functions ranges from 5% to 77.3%, with on average 49.2% of all functions in the program corpus.”
• “WebAssembly’s type checking of indirect calls can be seen as a form of control-flow integrity (CFI) for forward edges. Backward
edges, i.e., returns, are protected due to being managed by the VM and offer security conceptually similar to shadow stack solutions.”
• multiple memories proposal: Andreas Rossberg. “Multiple per-module memories for Wasm”. https://github.com/WebAssembly/multi-memory, 2019.
• reference types proposal: Andreas Rossberg. “Proposal for adding basic reference types”. https://github.com/WebAssembly/reference-types, 2019.
• “Examples that would benefit Web-Assembly compilers are FORTIFY_SOURCE-like code rewriting, stack canaries, CFI defenses, and safe unlinking in memory allocators. In particular for stack canaries and rewriting commonly exploited C string functions, we believe there are no principled hindrances to deployment.”
• “Developers of WebAssembly applications can reduce the risk by using as little code in “unsafe” languages, such as C, as possible.”
• “The language has a formally defined type system shown to be sound”
• via [37] = “Not So Fast: Analyzing the Performance of WebAssembly vs. Native Code”: “We find that the mean slowdown of WebAssembly vs. native across SPEC bench-marks is 1.55×for Chrome and 1.45×for Firefox, with peak slowdowns of 2.5×in Chrome and 2.08×in Firefox”

summary

Wonderful paper. Firstly, a security analysis was necessary and I think they covered an appropriate amount of attack vectors. Secondly it is, of course, sad to see that many old vulnerabilities still work on a new platform.

I am unconditionally in favor of true read-only memory in WebAssembly. As David points out, this can also be enforced by a compiler by appropriate runtime checks. However, David also specified a criterion: Iff a memory attack could lead to an escape from the WASM sandbox, then it is an issue of WebAssembly sandbox and should be prevented at this level.

I keep wondering about stack canaries and guard pages. Maybe it was a decision between the trade-off of security and performance. But I am not convinced by it with 100%. Thirdly, the paper is well-structured and gives sufficient data to reason its arguments. The attacks were okay, the introduction to WebAssembly was superb and justifying the claims regarding indirect calls with quantitative data in section 6 was outstanding. I think everyone in IT security can easily follow it. I love it!

WebAssembly architecture:

• not possible
• overwriting string literals in supposedly constant memory is not possible
• In WebAssembly, linear memory is non-executable by design, as it cannot be jumped to
• WebAssembly’s linear memory is a single, contiguous memory space without any holes, so every pointer ∈ [0, max_mem] is valid.
• measures, restrictions
• program memory, data structures of underlying VM and stack are separated
• binaries are easily type-checked
• only jump to designated code locations
• WebAssembly has two mechanisms that limit an attacker’s ability to redirect indirect calls. First, not all functions defined in or exported into a WebAssembly binary appear in the table for indirect calls, but only those that may be subject to an indirect call. Second, all calls, both direct and indirect, are type checked.
• missing
• In WebAssembly, there is no ASLR.
• does not use stack canaries
• buffer and stack overflows are thus very powerful attack primitives in WebAssembly
• an overflow while writing into a local variable on the unmanaged stack, e.g., buffer, may overwrite other local variables in the same and even in other stack frames upwards in the stack,
• As WebAssembly has no way of making data immutable in linear memory, an arbitrary write primitive can change the value of any non-scalar constant in the program, including, e.g., all string literals.

toc

`• Introduction• Background on WebAssembly• Security Analysis of Linear Memory  • Managed vs. Unmanaged Data  • Memory Layout  • Memory Protections• Attack Primitives  • Obtaining a Write Primitive    • Stack-based Buffer Overflow    • Stack Overflow    • Heap Metadata Corruption  • Overwriting Data    • Overwriting Stack Data    • Overwriting Heap Data    • Overwriting “Constant” Data  • Triggering Unexpected Behavior    • Redirecting Indirect Calls    • Code Injection into Host Environment    • Application-specific Data Overwrite• End-to-End Attacks  • Cross-Site Scripting in Browsers  • Remote Code Execution in Node.js  • Arbitrary File Write in Stand-alone VM• Quantitative Evaluation  • Experimental Setup and Analysis Process  • Measuring Unmanaged Stack Usage  • Measuring Indirect Calls and Targets  • Comparing with Existing CFI Policies• Discussion of Mitigations  • WebAssembly Language  • Compilers and Tooling  • Application and Library Developers• Related Work• Conclusion`

### FastSpec: Scalable Generation and Detection of Spectre… §

Title: “FastSpec: Scalable Generation and Detection of Spectre Gadgets Using Neural Embeddings” by M. Caner Tol, Koray Yurtseven, Berk Gulmezoglu, Berk Sunar [url] [dblp]
Published in 2020 at and I read it in 2020-07
Abstract: Several techniques have been proposed to detect vulnerable Spectre gadgets in widely deployed commercial software. Unfortunately, detection techniques proposed so far rely on hand-written rules which fall short in covering subtle variations of known Spectre gadgets as well as demand a huge amount of time to analyze each conditional branch in software. Since it requires arduous effort to craft new gadgets manually, the evaluations of detection mechanisms are based only on a handful of these gadgets.

ambiguity

• DNN (page 3) is undefined
• Equation 1: Notation X~Pdata is unknown to me
• “which alters the flags”. What is a flag? (Step 4, page 5)

quotes

• “Several techniques have been proposed to detect vulnerable Spectre gadgets in widely deployed commercial software. Unfortunately, detection techniques proposed so far rely on hand-written rules which fall short in covering subtle variations of known Spectre gadgets as well as demand a huge amount of time to analyze each conditional branch in software.”
• “In this work, we employ deep learning techniques for automated generation and detection of Spectre gadgets.”
• “Using mutational fuzzing, we produce a data set with more than 1 million Spectre-V1 gadgets which is the largest Spectre gadget data set built to date.”
• “we conduct the first empirical usability study of Generative Adversarial Networks (GANs) for creating assembly code without any human interaction.”
• “While the initial variants of Spectre [26] exploit conditional and indirect branches, Koruyeh et al. [27] proposes another Spectre variant obtained by poisoning the entries in Return-Stack-Buffers (RSBs).”
• “The proposed detection tools mostly implement taint analysis [66] and symbolic execution [15,65] to identify potential gadgets in benign applications. However, the methods proposed so far are have two shortcomings: (1) the scarcity of Spectre gadgets prevents the  comprehensive evaluation of the tools, (2) the scanning time increases drastically with increasing binary file sizes.”
• “BERT [9] was proposed by the Google AI team to learn the relations between different words in a sentence by applying a self-attention mechanism [63].”
• “In summary,
• We extend 15 base Spectre examples to 1 million gadgets via mutational fuzzing,
• We propose SpectreGAN which leverages conditional GANs to create new Spectre gadgets by learning the distribution of existing Spectre gadgets in a scalable way,
• We show that both mutational fuzzing and SpectreGAN create diverse and novel gadgets which are not detected by oo7 and Spectector tools
• We introduce FastSpec which is based on supervised neural word embeddings to identify the potential gadgets in benign applications orders of magnitude faster than rule-based methods.”
• “Hence, Spectre-type attacks are not completely resolved yet and finding an efficient countermeasure is still an open problem.”
• “There are a number of known Spectre variants: Spectre-V1 (bounds check bypass), Spectre-V2 (branch target injection), Spectre-RSB [27, 34] (return stack buffer speculation), and Spectre-V4 [19] (speculative store bypass).”
• “The heavy part of the training is handled by processing unlabeled data in an unsupervised manner. The unsupervised phase is called pre-training which consists of masked language model training and next sentence prediction procedures.”
• Defenses against Spectre. There are various detection methods for speculative execution attacks. Taint analysis is used in oo7 [66] software tool to detect leakages. As an alternative way, the taint analysis is implemented in the hardware context to stop the speculative execution for secret dependent data [53, 71]. The second method relies on symbolic execution analysis. Spectector [15] symbolically executes the programs where the conditional branches are treated as mispredicted. Furthermore, SpecuSym [18] and KleeSpectre [65]
aim to model cache usage with symbolic execution to detect speculative interference which is based on Klee symbolic execution engine. Following a different approach, Speculator [35] collects performance counter values to detect mispredicted branches and speculative execution domain. Finally, Specfuzz [44] uses a fuzzing strategy to analyze the control flow paths which are most likely vulnerable against speculative execution attacks.”
• “Our mutation operator is the insertion of random Assembly instructions with random operands.”
• “The overall success rate of fuzzing technique is around 5% out of compiled gadgets.”
• “the input assembly functions are converted to a sequence of tokens T' = {x'1, … x'N} where each token represents an instruction, register, parenthesis, comma, intermediate value or label. SpectreGAN is conditionally trained with each sequence of tokens where a masking vector m = (m1, …, mN) with elements mt ∈ {0, 1} is generated.”
• “The training procedure consists of two main phases namely, pre-training and adversarial training.”
• “We keep commas, parenthesis, immediate values, labels, instruction and register names as separate tokens.”
• “The tokenization process converts the instruction "movq (%rax), %rdx" into the list ["movq", "(", "%rax", ")", ",", "%rdx"] where each element of the list is a token x't. Hence, each token list T' = {x'1, …, x'N} represents an assembly function in the data set.”
• “SpectreGAN is trained with a batch size of 100 on NVIDIA GeForce GTX 1080 Ti until the validation perplexity converges in Figure 2. The pre-training lasts about 50 hours while the adversarial training phase takes around 30 hours.” (80 hours = 3d 8h)
• “SpectreGAN is trained with a masking rate of 0.3, the success rate of gadgets increases up to 72%. Interestingly, the success rate drops for other masking rates, which also demonstrates the importance of masking rate choice.”
• “To illustrate an example of the generated samples, we fed the gadget in Listing 2 to SpectreGAN and generated a new gadget in Listing 3.”
• “Mutational fuzzing and SpectreGAN generated approximately 1.2 million gadgets in total.”
• “The quality of generated texts is mostly evaluated by analyzing the number of unique n-grams.”
• “However, it is challenging to examine the effects of instructions in the transient domain since they are not visible in the architectural state. After we carefully analyzed the performance counters for the Haswell architecture, we determined that two counters namely, UOPS_ISSUED : ANY and UOPS_RET IRED : ANY give an idea to what extent the speculative window is altered. UOPS_ISSUED : ANY counter is incremented every time a μop is issued which counts both speculative and non-speculative μops. On the other hand,  UOPS_RET IRED : ANY counter only counts the executed and committed μops which automatically excludes speculatively executed μops.”
• “we have selected 100,000 samples from each gadget example uniformly random due to the immense time consumption of oo7 (150 hours for 100K gadgets) which achieves 94% detection rate.”
• “Interestingly, specific gadget types from both fuzzing and SpectreGAN are not caught by oo7. When a gadget contains cmov or xchg or set instruction and its variants, it is not identified as a Spectre gadget.”
• “Listing 5: XCHG gadget: When a past value controlled by the attacker is used in the Spectre gadget, oo7 cannot detect the XCHG gadget”
• “For each Assembly file, Spectector is adjusted to track 25 symbolic paths of at most 5000 instructions each, with a global timeout of 30 minutes. The remaining parameters are kept as default.”
• “23.75% of the gadgets are not detected by Spectector. We observed that 96% of the undetected gadgets contain unsupported instruction/register which is the indicator of an implementation issue in Spectector.”
• “After we examined the undetected gadgets, we observed that if the gadgets include either sfence/mfence/lfence or 8-bit registers (%al, %bl, %cl, %dl), they are likely to bypass Spectector.”
• “Differently, the mask positions are selected from 15% of the training sequence and the selected positions are masked and replaced with <MASK> token with 0.80 probability, replaced with a random token with 0.10 probability or kept as the same token with 0.10 probability.”
• “Since it is not possible to visualize the high dimensional embedding vectors, we leverage the t-SNE algorithm [33] which maps the embedding vectors to a three-dimensional space as shown in Figure 4.”
• “The output probabilities of the softmax layer are the predictions on the assembly code sequence.”
• “We combine the assembly data set that was generated in Section 4 and the disassembled Linux libraries to train FastSpec. Although it is possible that Linux libraries contain Spectre-V1 gadgets, we assume that the total number of hidden Spectre gadgets are negligible comparing the total size of the data set.”
• “In total, a dataset of 107 million lines of assembly code is collected which consists of 370 million tokens after the pre-processing.”
• “The pre-training phase takes approximately 6 hours with a sequence length of 50. We further train the positional embeddings for 1 hour with a sequence length of 250. The fine-tuning takes only 20 minutes on the pre-trained model to achieve classifying all types of samples in the test data set correctly.”
• “In the evaluation of FastSpec, we obtained 1.3 million true positives and 110 false positives (99.9% precision rate) in the test data set which demonstrates the high performance of FastSpec. We assume that the false positives are Spectre-like gadgets in Linux libraries, which needs to be explored deeply in the future work. Moreover, we only have 55 false negatives (99.9% recall rate) which yields 0.99 F-1 score on the test data set.”
• “The processing time of FastSpec is independent of the number of branches whereas for Spectector and oo7 the analysis time increases drastically.”
• “Consequently, FastSpec is faster than oo7 and Spectector 455 times and 75 times on average, respectively.”
• “The total number of tokens is 203,055 while the analysis time is around 17 minutes.”
• “This work for the first time proposed NLP inspired approaches for Spectre gadget generation and detection.”

summary

Very advanced paper. Perfect combination of Machine Learning technology with microarchitectural attack work. Shows a huge effort and nice considerations regarding Generative Adversarial Networks. However, I could not understand technical aspects of the machine learning part

• Several techniques have been proposed to detect vulnerable Spectre gadgets in widely deployed commercial software. Unfortunately, detection techniques proposed so far rely on hand-written rules. Current shortcomings are: (1) the scarcity of Spectre gadgets prevents the
comprehensive evaluation of the tools, (2) the scanning time increases drastically with increasing binary file sizes
• Approach:
1. Mutational fuzzing is used to expand Kocher's 15 + Spectector's 2 Spectre gadgets to more than 1 miilion
2. A Generative Adversarial Network (SpectreGAN; based on MaskGAN) is used to generate Assembly
3. FastSpec (based on BERT by Google) takes Assembly and determines whether some binary contains a Spectre gadget
• They achieve 455× the performance of oo7 and 75× the performance of Spectector
• The Linux kernel shows 1.3 million true positive Spectre gadgets and FastSpec found 110 false positives in 107 million lines of ASM code
• 379 matches were found in the OpenSSL 1.1.1g library
• On github, there is a 390 MB tar gz archive (split up). Decompressed, it has a size of 6.7 GB. 972 MB seem to be 239 025 Spectre test gadget files in ASM format

The masking rate (page 6) seems to be the percentage of hidden tokens during the training phase. Figure 4 is a little bit awkward and a little bit random. FastSpec/scan.sh seems to show how FastSpec was called to evaluate OpenSSL. And commands.txt tries to explain it somehow.

typo

• “The critical time period before the flush happens is commonly referred to the transient domain.” → ““The critical time period before the flush happens is commonly referred to as the transient domain.”
• “microarchtiectures” → “microarchitectures”
• “An resule of a sample gadget” → “An result of a sample gadget”

### High-speed Instruction-set Coprocessor for Lattice-bas… §

Title: “High-speed Instruction-set Coprocessor for Lattice-based Key Encapsulation Mechanism: Saber in Hardware” by Sujoy Sinha Roy, Andrea Basso [url] [dblp]
Published in 2020 at CHES 2020 and I read it in 2020-11
Abstract: In this paper, we present an instruction set coprocessor architecture for lattice-based cryptography and implement the module lattice-based post-quantum key encapsulation mechanism (KEM) Saber as a case study. To achieve fast computation time, the architecture is fully implemented in hardware, including CCA transformations. Since polynomial multiplication plays a performance-critical role in the module and ideal lattice-based public-key cryptography, a parallel polynomial multiplier architecture is proposed that overcomes memory access bottlenecks and results in a highly parallel yet simple and easy-to-scale design. Such multipliers can compute a full multiplication in 256 cycles, but are designed to target any area/performance trade-oﬀs. Besides optimizing polynomial multiplication, we make important design decisions and perform architectural optimizations to reduce the overall cycle counts as well as improve resource utilization.

open questions

• Figure 3: Why are control signals leading to the building blocks like AddPack? Wouldn't it be simpler (for synchronization) to make control signals part of the communication over the bus?
• “The overhead of memory access during polynomial multiplication plays a critical role in lattice-based cryptography (e.g., [RVM+14], [BMTK+20]) and could hinder or complicate logic-level parallel processing.”
• What kind of role?

quotes

• “In 2012, Göttert et al. [GFS + 12] reported the first hardware implementation of the ideal lattice-based LPR [LPR10] public-key encryption scheme. Their implementation used a massively parallel and unrolled NTT-based polynomial multiplier architecture that consumed millions of LUTs and flip-flops.”
• “A comparison of most round 2 submissions, including Saber, can be found in [DFA + 20].”
• “In [DKSRV18] the authors of Saber proposed a fast polynomial multiplier based on the Toom-Cook algorithm [Knu97] and showed that a non-NTT parameter set does not make their implementation slow.”
• “In practice, more than 50% of the computation time is spent on generating pseudo-random numbers using SHAKE128, thus making it the performance bottleneck.”
• “Dang et al. [DFAG19] compare seven lattice-based key encapsulation methods on HW/SW codesign platforms. They report that out of the seven tested protocols (FrodoKEM, Round5, Saber, NTRU-HPS, NTRU-HRSS, Streamlined NTRU Prime and NTRULPRime), Saber is the fastest protocol in the encapsulation operation and second fastest in the decapsulation operation.”
• “At the same time, implementing such an accelerator is a challenging research topic because it requires making careful design decisions that take into account both algorithmic and architectural alternatives for the internal building blocks and their interactions at the protocol level.”
• “When a HW-only implementation is considered, one design option is to cascade different building blocks in the data-path following the standard data-flow model, if the blocks are required in multiple parallel instances.”
• “To achieve programmability and flexibility, we realize an instruction-set coprocessor architecture for Saber.”
• “In this work, we use the open-source high-speed implementation of the Keccak core that was designed by the Keccak Team [Tea19]. This high-speed implementation of Keccak computes ‘state-permutations’ at a gap of only 28 cycles, thus generating 1,344 bits of pseudo-random string every 28 cycles during the extraction-phase. Furthermore, we observed that one instance of the Keccak core consumes around 5K LUTs and 3K registers, which are respectively nearly 21% and 31% of the overall area in our implementation.”
• via SaberX4 paper: “Our proof-of-concept software implementation of SaberX4 achieves nearly 1.5 times higher throughput at the cost of latency degradation within acceptable margins, compared to the AVX2-
optimized non-batched implementation of Saber by its authors.“
• “The use of ‘4-bit’ signed-magnitude’ representation simplifies the hardware architecture because we can store 16 such samples easily in a 64-bit word of the data memory. Thus, no sample is split across two words.”
• Because consider that [-3, 3] requires 3 bits (7 states), [-4, 4] requires 4 bits (9 states) and [-5, 5] requires 4 bits (11 states). If we use 4 bits, we can cover all cases and it will be not split among bytes. Consider 3 bits. Then we have the following partition among 3 bytes: [3 3 2][1 3 3 1][2 3 3]
• “asymptotically the second fastest after the NTT-based polynomial multiplication.”
• Remove “the”, since Schönhage-Strassen and Fürer are two NTT-based multiplication algorithms
• “The hardware implementation of the Toom-Cook polynomial multiplication by Bermudo Mera et al. [BMTK + 20] describes the challenges in implementing the recursive function calls in hardware and proposes efficient architectures.”
• “This nega-cyclic rotation happens since the reduction-polynomial is x256 +1.”

summary

A good read with appropriate comparisons with other schemes and implementations. Reasonable arguments are provided for design decisions. The runtime results were expectable and have been met.

• “In this paper, we present an instruction set coprocessor architecture for lattice-based cryptography and implement the module lattice-based post-quantum key encapsulation mechanism (KEM) Saber as a case study.”
• “Since polynomial multiplication plays a performance-critical role in the module and ideal lattice-based public-key cryptography, a parallel polynomial multiplier architecture is proposed that overcomes memory access bottlenecks and results in a highly parallel yet simple and easy-to-scale design.”
• “For the module dimension 3 (security comparable to AES-192), the coprocessor computes CCA key generation, encapsulation, and decapsulation in only 5,453, 6,618 and 8,034 cycles respectively, making it the fastest hardware implementation of Saber to our knowledge.”
• “module dimension 3” corresponds to “NIST PQC category 3”
• “The Vivado project and all HDL source codes are available at https://github.com/sujoyetc/SABER_HW.”
• In Figure 3, the pseudo-random word from data memory is split into chunks of size μ. μ bits are turned into a binomial sample of size 4 bits.
• “There is no conditional branching in the algorithms used and all the building blocks have been designed to be constant-time.”
• “Software benchmarking [KRSS19] of many lattice-based KEM schemes have reported that 50-70% of the overall computation time is spent on executing the Keccak function, thus making it the most performance-critical component.”
• The Saber reference implementation uses code similar to Kyber for binomial sampling
• In Figure 7, the sign bit only depends on one argument, because s(x) is always positive.
• “The instruction-set coprocessor architecture is described in mixed Verilog and VHDL and is compiled using Xilinx Vivado for the target platform Xilinx ZCU102 board that has an UltraScale+ XCZU9EG-2FFVB1156 FPGA.”
• “We tested the functional correctness of the coprocessor on the ZCU102 board and at 250 MHz clock frequency, the CCA-secure key generation, encapsulation and decapsulation operations take 21.8, 26.5, and 32.1 μs respectively.”
• “As the polynomial multiplier architecture is scalable, we implemented a variant of it with MAC units fitting two multipliers. With this higher-performing architecture, the cycle counts for polynomial multiplications nearly halves, …”
• “The overall cycle count for Saber (module dimension 3) is 4,320, 5,231 and 6,461 for key generation, encapsulation, and decapsulation respectively. Thus, the cycle count is reduced by 21%, 21%, and 20% respectively. The increased speed comes with increased area consumption of 1.83× for LUTs and 1.74× for flip-flops (this is both due to the increased area consumption of the MAC units with two multipliers and due to the pipelining).”
• “In Table 5 we compare our flexible architecture with some of the recent hardware implementations of post-quantum KEM schemes.”
• “The SIKE [JF11] scheme relies on the computational hardness of the supersingular isogeny problem. Its most recent hardware implementation by Massolino et al. [MLRB20] targets high speed and even beats Frodo KEM. Our hardware implementation of Saber is around 500 to 600 times faster than their implementation.”
`Instruction                             Cycle Count                             Keygen   Encapsulation  Decapsulation––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––SHA3-256                        339             585            303SHA3-512                          0              62             62SHAKE-128                     1,461           1,403          1,403Vector sampling                 176             176            176Polynomial multiplications    2,685           3,592          4,484               (i.e. 47% of column)      (i.e. 54%)     (i.e. 56%)Remaining operations            792             800          1,606Total cycles                  5,453           6,618          8,034Total time at 250 MHz       21.8 μs         26.5 μs        32.1 μs`

typo

• “This means that each MAC unit should receive in input”
“This means that each MAC unit should receive an input”
• “Nevertheless, our coprocessor has been tested in the hardware”
“Nevertheless, our coprocessor has been tested in hardware”

### Historical Notes on the Fast Fourier Transform §

Title: “Historical Notes on the Fast Fourier Transform” by W Lewis, Peter D Welch [url] [dblp]
Published in 1967 at IEEE 1967 and I read it in 2020-03
Abstract: The fast Fourier transform algorithm has a long and interesting history that has only recently been appreciated. In this paper, the contributions of many investigators are described and placed in historical perspective.

summary

• Survey paper on historical developments of the four years before. Nice and straight. Forward summary though I didn't go through details of the Prime factor algorithm.
• “The greatest emphasis, however, was on the computational economy that could be derived from the symmetries of the sine and cosine functions”
• “use the periodicity of the sine-cosine functions to obtain a 2N-point Fourier analysis from two N-point analyses with only slightly more than N operations. Going the other way, if the series to be transformed is of length N and N is a power of 2, the series can be split into log, N subseries”
• “The number of computations in the resulting successive doubling algorithm is therefore proportional to N log₂ N rather than N²”
• “The fast Fourier transform algorithm of Cooley and Tukey is more general in that it is applicable when N is composite and not necessarily a power of 2”
• “the algorithms are different for the following reaons : 1) in the Thomas algorithm the factors of N must be mutually prime; 2) in the Thomas algorithm the calculation is precisely multidimensional Fourier analysis with no intervening phase shifts or “twiddle factors” as they have been called ; and 3) the correspondences between the one-dimensional index and the multidimensional indexes in the two algorithms are quite different.”
• “The factor Wj0n0, referred to as the “twiddle factor” by Gentleman and Sande, is usually combined with either the Wrj0n factor in (eq 4) or the Wsjin0 factor in (eq 5)”
• eq 4: A_1(j_0, n_0) = \sum_{n1=0}^{r-1} A(n_1, n_0) W_r^{j_0 n_1}
• eq 5: X(j_1, j_0) = \sum_{n_0=0}^{s-1} A_1(j_0, n_0) W_s^{j_1 n_0} W_N^{j_0 n_0}

### Number "Not" Used Once - Key Recovery Fault Attacks on… §

Title: “Number "Not" Used Once - Key Recovery Fault Attacks on LWE Based Lattice Cryptographic Schemes” by Prasanna Ravi, Shivam Bhasin, Anupam Chattopadhyay [url] [dblp]
Published in 2018 at COSADE 2019 and I read it in 2020-05
Abstract: This paper proposes a simple single bit ﬂip fault attack applicable to several LWE (Learning With Errors Problem) based lattice based schemes like KYBER, NEWHOPE, DILITHIUM and FRODO which were submitted as proposals for the NIST call for standardization of post quantum cryptography. We have identiﬁed a vulnerability in the usage of nonce, during generation of secret and error components in the key generation procedure. Our fault attack, based on a practical bit ﬂip model (single bit ﬂip to very few bit ﬂips for proposed parameter instantiations) enables us to retrieve the secret key from the public key in a trivial manner. We fault the nonce in order to maliciously use the same nonce to generate both the secret and error components which turns the LWE instance into an exactly deﬁned set of linear equations from which the secret can be trivially solved for using Gaussian elimination.

ambiguity

What is z in the provided interval in section 2.6? Seems to be either an arbitrary integer or z of Algorithm 1 (NewHope)

error

“later converted to the NTT domain using the PolyBitRev function” … no, it is a separate step

inconsistency

• t = As1 + s2 in flow text in section 2.5
• t = A × s1 + s2 in Algorithm 4

quotes

• “Nonces are predominantly used in all the afore mentioned schemes in order to reduce the amount of randomness required to generate the secret and error components used in the generation of the LWE instance during key generation.”
• “main focus on the key generation algorithm, as that is the target of our attack.”
• “We also use x ← Sη to denote the module x whose coefficients lie in the range [−η, η].”
• About Kyber: “Apart from this, there are other modifications to the scheme like the modified technique for generation of the public module A as in [4], compressed public key and ciphertexts through "Bit-dropping" using the Learning-with-rounding (LWR) problem”
• “every coefficient of the recovered t is only a perturbed version of the original t generated during the key-generation procedure KYBER.CPAPKE.GEN(). Thus, the public key can be assumed to be built based on the hardness of both the MLWE and MLWR problem.”
• About Frodo: “The modulus q is chosen to be a power of 2 that enables easy reduction through bit masking.”
• “For example, if the error component is an all zero vector, then the LWE instance is converted into a set of linear equations with equal number of equations and unknowns. This instance can be solved using straight forward Gaussian elimination. If the error component only has values in a fixed interval [z + 1/2, z - 1/2], then one can just "round away" the non-integral part and subtract z to remove the error from every sample [29]. There are also other easy instances of LWE, For eg. From a given set of n LWE instances, if k of the n error components add up to zero, then one can simply add the corresponding samples to cancel the error and obtain an error free sample. It is also possible to solve an LWE instance in roughly nd time and space using nd samples if the error in the samples lie in a known set of size d [5]. For a very small d, this yields a very practical attack.”
• “Thus, if an attacker can perform a single bit flip of the nonce such that both the calls to the function poly_sample use the same seed, then it yields s = e.”
• “Thus, ultimately the attacker has to inject a minimum of just two bit flips in order to create a very easy LWE instance that results in direct retrieval of the secret key S through Gaussian elimination.”
• “In KYBER, the dimensions of both s1 and s2 are the same and equal k. But, the generated MLWE instance t = a × s1 + s2 is further protected by the hardness of the MLWR problem through the Compressq function and hence the public key is formed by a combination of MLWE and MLWR instances.”
• “Since the nonce is deterministically generated, one can use an error correction scheme to check if the correct value of nonce is being used for the generation of polynomials.”
• “The motivation to use a nonce for generation of polynomials is mainly to reduce the randomness requirement and use the same seed to generate multiple polynomials.”

summary

• paper from 2019
• public key in Learning With Errors problem is given by A×s + e
• If s = e, then A×s + s gives a trivial linear equation system to solve to get the secret key s
• if the nonce is the same between the generation of s and e, then s = e
• stuck-at fault attack on nonces in NewHope, Kyber, Dilithium, and Frodo schemes
• trivial for NewHope (keep poly_sample(&shat, noiseseed, 0) the same, apply stuck-at(0) to poly_sample(&ehat, noiseseed, 1))
• Technically more difficult for Frodo, because stuck-at(1) must be applied to Frodo.SampleMatrix(seed_E, n, n_bar, T_chi, 2). Requires two stuck-at bits unlike NewHope
• For Dilithium, l stuck-ats required, but then equation system is only solvable if enough equation (i.e. signatures) are available
• For Kyber, k stuck-ats required, but the equation system must not necessarily be solvable, since rounding still protects it

Short paper, tiny errors, no practical implementation (how successful can the attack on Frodo be implemented?) (correction: unlike the paper, their presentation slides show evaluation data), the structure of the text is beautiful and it is well-written. I particularly enjoyed the summary of weak LWE instances. Masking is also another countermeasure, but is heavy-weight compared to the proposed error-correcting codes

typo

The aforementioned fault vulnerabilities associated with the nonce only occur due to the implementation strategies used and hence can we believe can be easily corrected albeit with a small performance overhead.

⇒ hence can, we believe, can be easily corrected

### On the criteria to be used in decomposing systems into… §

Title: “On the criteria to be used in decomposing systems into modules” by David Lorge Parnas [url] [dblp]
Published in 1972 at CACM, Volume 15, 1972 and I read it in 2021-03
Abstract: This paper discusses modularization as a mechanism for improving the flexibility and comprehensibility of a system while allowing the shortening of its development time. The effectiveness of a "modulariza tion11 is dependent upon the criteria used in dividing the system into modules. Two system design problems are presented, and for each, both a conventional and unconventional decomposition are described. It is shown that the unconventional decompositions have distinct advantages for the goals outlined. The criteria used in arriving at the decomposi tions are discussed. The unconventional decomposition, if implemented with the conventional assumption that a module consists of one or more subroutines, will be less efficient in most cases. An alternative approach to implementation which does not have this effect is sketched.

quotes

• “A well-defined segmentation of the project effort ensures system modularity”
• “The major advancement in the area of modular programming has been the development of coding techniques and assemblers which (1) allow one module to be written with little knowledge of the code in another module, and (2) allow modules to be reassembled and replaced without reassembly of the whole system”
• “Below are several partial system descriptions called modularizations. In this context ‘module’ is considered to be a responsibility assignment rather than a subprogram. The modularizations include the design decisions which must be made before the work on independent modules can begin.”
• “The system is divided into a number of modules with well-defined interfaces; each one is small enough and simple enough to be thoroughly understood and well programmed.”
• “In the first modularization the interfaces between the modules are the fairly complex formats and table organizations described above.”
• “In the second modularization the interfaces are more abstract; they consist primarily in the function names and the numbers and types of the parameters”
• “In the first decomposition the criterion used was to make each major step in the processing a module”
• “The second decomposition was made using ‘information hiding’ as a criterion”
• “In discussions of system structure it is easy to confuse the benefits of a good decomposition with those of a hierarchical structure”
• “We have tried to demonstrate by these examples that it is almost always incorrect to begin the decomposition of a system into modules on the basis of a flowchart. We propose instead that one begins with a list of difficult design decisions or design decisions which are likely to change. Each module is then designed to hide such a decision from the others”

summary

First, someone needs to get familiar with KWIC (recognize the paper reference in the section below). KWIC felt like an arbitrary index someone came up with. I got confused by phrases like “the characters are packed four to a word”, which make little sense outside the index context of a book. But after reading the paper, I looked up the Wikipedia article and learned about its usecase (index of keywords before full-text search was available or in print). The paper is considered an ACM Classic and he got high praise for it.

Essentially, first I had to understand the setting when this paper was written. I grew up with data encapsulation in object-oriented programming, local scoping in programming languages and manipulating data behind a pointer was already a foreign, dangerous idea. The setting is the transition of assembler language towards more high-level languages with questions regarding information hiding arising.

In modular design, his double dictum of high cohesion within modules and loose coupling between modules is fundamental to modular design in software. However, in Parnas's seminal 1972 paper On the Criteria to Be Used in Decomposing Systems into Modules, this dictum is expressed in terms of information hiding, and the terms cohesion and coupling are not used. He never used them.
Wikipedia: David Parnas

I would define a module as a set of functionality (independent of representation in a programming language). High cohesion within modules and loose coupling between modules is a defining criterion for a good programmer. What I consider an open question, but often triggers bugs is the missing documentation for the interface between modules. Often a data structure transfers the data from one module to another. An informal description often triggers different expectations regarding the content of the data structure.

Back to the paper, it illustrates the decomposition of a system by two exemplary modularizations. Whereas the first decomposition was created along the major steps of the processing routines, the second decomposition was created with information hiding in mind. Then several recommendable criteria for decompositions are mentioned:

1. A data structure, its internal linkings, accessing procedures and modifying procedures are part of a single module.
2. The sequence of instructions necessary to call a given routine and the routine itself are part of the same module
3. The formats of control blocks used in queues in operating systems and similar programs must be hidden within a “control block module.”
4. Character codes, alphabetic orderings, and similar data should be hidden in a module for greatest flexibility
5. The sequence in which certain items will be processed should (as far as practical) be hidden within a single module

In the end, I think the paper advocates a clean style which (in some sense and with some limitations) is still true today (e.g. web frameworks heavily limit the way you can define your architecture). I recommend every programmer to reflect about possible decompositions of a system, because the most intuitive one might not be the best. The notion of a flowchart approach being the sequential one, is however awkward and foreign to me.

### PDF/A considered harmful for digital preservation §

Title: “PDF/A considered harmful for digital preservation” by Marco Klindt [url] [dblp]
Published in 2017 at iPRES 2017 and I read it in 2020-12
Abstract: Today, the Portable Document Format (PDF) is the prevalent file format for the exchange of fixed content electronic documents for publication, research, and dissemination work in the academic and cultural heritage domains. Therefore it is not surprising that PDF/A is perceived to be an archival format suitable for digital archiving workflows.

clarification needed

• “PDF reduces the computational burden of the display device by
executing the necessary PostScript programs during the creation
of the PDF file.”
• PostScript is presented with computational burden.
• PDF is consequently describes as object store of PostScript elements
• How does this reduce the computational burden?
• I assume that reuse of objects is the answer, but stating as such would be useful
• “Fonts with open licenses like SIL Open Font License 4 circumvent possible restrictions but also complicate conversion due to differences in substitute font dimensions.”
• not an inherent property of OFL fonts?”

errors

“The textual markup of Markdown variants is machine actionable while being human friendly to read at the same time. It is suitable for structured texts (including lists and tables) where the exact layout is not as important. Markdown is not well suited for validation.”

• lists are troublesome as nested lists occur in various flavors and were not considered in its initial design
• tables were not part of the first Markdown design and various contradicting implementations exist

quotes

Interesting:

• “In a quick analysis of institutional repositories hosted at the ZIB, the siegfried file identification tool 1 identified 44,114 or 84% from a total of 52,611 documents as PDF (and 1,168 or 0.03% of these as PDF/A). Other file formats included Word, WordPerfect, PostScript files and a long tail of more obscure document formats.”
• “Digital preservation is primarily concerned with keeping information contained in digital objects or documents usable for future use.”
• “The accepted reference model for digital preservation systems is the Open Archival Information System (ISO 14721:2012, OAIS)[11]”
• “The usage of and commercial success began with the release of the free Acrobat Reader 2.0 in 1996 for PDF 1.1 and licensing all patents royalty free for everyone using its format. It became the de-facto exchange format for electronic documents and version 1.7 was finally standardized by the International Standards Organization as ISO 32000-1[15] in 2008.”
• “Until the 10.1.5 and 11.0.01 updates Adobe Acrobat products have historically opened a PDF as long as the %PDF-header started anywhere within the first 1024 bytes of the file.”
• “To extract information from content in PDF, tags can be attached to PDF objects from version 1.4 onward. These tags act as markup to denote the logical structure (semantic elements), and logical order (flow) of the content.”
• “All content shall be marked in the structure tree with semantically appropriate tags (i.e. headings, formulas, paragraphs and such) in the logical, intended reading order.”
• “PDF also does not provide different perspectives on textual content. Electronic documents may want to provide different views of the text or data, either in multiple languages, diplomatic or critical transcriptions, or from different sources.”
• “Usability issues aside, Willinsky et al.[32] give an excellent overview about current issues with using PDF in the scholarly environment. They hope, that their observations will influence further development of PDF or even the ‘Great PDF Replacement Format (GPDFRF)’.”
• “Converting “normal” PDFs to PDF/A a-level conformance automatically is not advisable as a lot of information may already be lost during the creation process of the document.”
• “But even PDF/A a-level conformance may not guarantee full text recovery due to the fact that some tagging features are only recommendations and not mandatory. Hyphenation (the word division at the end of a line) shall be treated as an incidental artifact and be represented as a unicode soft-hyphen (U+00AD) instead of a hard-hyphen (U+002D) as suggested by the standard.”
• “But even PDF/A a-level conformance may not guarantee full text recovery due to the fact that some tagging features are only recommendations and not mandatory. Hyphenation (the word division at the end of a line) shall be treated as an incidental artifact and be represented as a unicode soft-hyphen (U+00AD) instead of a hard-hyphen (U+002D) as suggested by the standard.”
• “It is alternatively possible to provide the /ActualText attribute without the hyphen.”
• “For some time, the go-to-tool for PDF/A validation was JHOVE 5 using PDF profiles. As it was discovered that it was not suitable for validating PDF/A files[29], the EU funded PREFORMA project 6 included a provision to create veraPDF 7 , a validator which aims at checking conformance of all PDF/A flavors while also allowing for policy checks that are customizable to institutional policy.”
• “Some possible strategies for the better handling of PDFs mostly involve the content producers but also create more involved workflows within the archive:
• Negotiate non-PDF documents better suited for their domain and supported by your archive system.
• Consider using PDF/A as a dissemination format only (and therefore use a PDF rendition server only for access not ingest).
• Save the original source documents alongside the PDFs for full text and structure retention. With PDF/A-3 these could be embedded and linked as source of the document.
• Require data producers to implement workflows that adhere to the Matterhorn protocol to assure fully, meaningful tagged PDFs (including MathML formulas, semantically tagged data and so on) and to provide /ActualText for every textual information contained in the PDF that is not easily extractable otherwise.“
• “WebArchive (WARC) files bundle all necessary components and are already in use in digital archiving.”

Well put:

• “While humans have the ability to recognize the structure of text from layout, which is a necessary requirement for meaningful extraction of information and therefore gaining knowledge from texts and illustrations including diagrams, formulas, and tables, machine-based technology is not yet able to achieve this to the same extent. This makes it difficult for such technology to use or reuse the information contained in PDFs.”
• “Adobe extended the PDF specification multiple times over the years to allow for more features like encryption, transparency, device-independent colors, forms, web-links, javascript, audio, video, 3D objects and many more[18].”
• “PDF supports incremental updates of its content. New objects, a new cross-reference table and a new trailer can be appended to the end of the file, if the content of the PDF is updated, without the need to rewrite the whole file. As objects can be marked as deleted in the xref-table there is no need to delete the corresponding objects in the body section.”
• “PDF can also define different rectangles useful in print like crop boxes, bleed boxes, trim boxes, and art boxes (refer to the PDF reference[7] for additional information).”
• “A standard for required tag usage was published by ISO as ISO 14289[10] known as PDF/UA in 2014 (thus after the publication of PDF/A-2/3). Even though being accessible by AT (i.e. software) is a legal requirement in some domains, creating compliant documents is still a complex and cumbersome endeavor. Even assessing compliance to PDF/UA is quite hard: The Matterhorn protocol[24] provides a testing model that defines 31 checkpoints comprised of 136 failure conditions encompassing file format requirements for AT accessible PDF/UAs of which some are not applicable to PDF/A (e.g. related to javascript). While 87 failure conditions are determinable by software 47 usually require human judgement or assessment. Failure condition 06-003 for example is machine testable and requires the metadata stream to contain a dublincore:title while 06-004 requires that the title clearly identifies the document in respect to human knowledge, a check that obviously is not decidable by algorithms.”
• “Nielsen [23] argued in 2001 that the fixed, page-based layout of  PDF is not well suited for on-screen reading in contrast to web pages or other hypertext documents.”
• “If optical character recognition results are available they also are embedded into the PDF as a invisible text layer over the corresponding areas in the image of the original.”
• “An insightful analogue of the difference between human content understanding and machine extraction capabilities would be the visible communication of music. While storing the layout of sheet music is perfectly achievable with PDF the placement of note glyphs on lines with annotating glyphs for bars, clefs and so on, it is easily understood and transformed into audible sound by humans trained in reading musical notation. A machine would have a hard time extracting enough information to reproduce or compare the musical score.”
• “What constitutes a word and finding word boundaries might be difficult by itself depending on the layout or script of the text. Selecting rows or columns from tables in PDF reader applications often also results in frustration.”
• “Searching for the string ”Rheinland” (German for Rhineland, a part of Germany) in the PDF/A-1a file of the nestor newsletter number 28[22] for example would result in no matches in macOS Preview or Adobe Reader as it is stored as a hard-hyphen. The hyphen in ”Ostwestfalen-Lippe” is a regular one.”
• “PDF/A is perceived to be an archival solution for digital documents. Discussion within the community revealed the reason for that is three-fold: Firstly, it is marketed as an archival format. The A in PDF/A might stand for “Archive” or “Archival” or simply for the letter “A”; I haven’t found any official explanation for the choice of A in the acronym. The second reason may be that it is used by so many institutions to a point where a critical mass is reached. They cannot altogether err in their risk assessment, so the reasoning is that you simply cannot be wrong when you run with the flock. And thirdly, there does not seem to be a better alternative available (see below).”
• “Pages are useful for citation in the traditional format of books or journals but with the advancement of digital publishing and linked data technologies it will be more useful to refer to information sets identified (and locatable) by persistent digital identifiers like URIs or IRIs.”
• “One teenager even wondered why YouTube isn’t mentioned in a book from 2005.”
• “In contrast to XHTML, an XML language, it is very robust to formal errors.”
• “Sullivan reports in her 2003 article (emphasis added): ‘The intent was not to claim that PDF-based solutions are the best way to preserve electronic documents. PDF/A simply defines an archival profile of PDF that is more amenable to long-term preservation than traditional PDF.’[27]”

summary

This paper summarizes some shortcomings why/how PDF/A is not the final solution for long-term archival of documents.

PDF features not available in PDF/A (list from external resource):

• Embedded video and audio files
• encryption
• transparency
• Javascript
• executable files
• functions
• layers

I would summarize those arguments as lack of annotations in the PDF to make PDFs accessible for machines and people with visual disabilities. This makes it difficult to index and retrieve document information as part of a large corpus (→ database). The problems boil down to the design of PDF, which allows multiple ways of encoding text information, which might loose e.g. reading order information. Refer to Table 2, for example. In the end, the tooling support is not great, but a lack of alternatives can be identified. However, if we consider information extraction capabilities as more important than (e.g. font embedding and reproducible layout), some alternatives are mentioned in section 5.1 (e.g. HTML/CSS or ODF/OOXML). One intriguing analogy is given in the paper: Musical typesetting can be done with PDF as output, but retrieving information about the music is very difficult.

In the conclusion the author admits that the PDF/A authors were aware of its shortcomings: “The intent was not to claim that PDF-based solutions are the best way to preserve electronic  documents. PDF/A simply defines an archival profile of PDF that is more amenable to long-term preservation than traditional PDF”

In the end, the paper is a summary which does not provide any solution as pointed out by the author. As a critic of Markdown, I am saddened to see that Markdown was even mentioned (but other markup languages are neglected).

### Piret and Quisquater’s DFA on AES Revisited §

Title: “Piret and Quisquater’s DFA on AES Revisited” by Christophe Giraud, Adrian Thillard [url] [dblp]
Published in 2010 at and I read it in 2020-06
Abstract: At CHES 2003, Piret and Quisquater published a very eﬃcient DFA on AES which has served as a basis for many variants published afterwards. In this paper, we revisit P&Q’s DFA on AES and we explain how this attack can be much more eﬃcient than originally claimed. In particular, we show that only 2 (resp. 3) faulty ciphertexts allow an attacker to eﬃciently recover the key in the case of AES-192 (resp. AES-256). Our attack on AES-256 is the most eﬃcient attack on this key length published so far.

quotes

• “we show that only 2 (resp. 3) faulty ciphertexts allow an attacker to efficiently recover the key in the case of AES-192 (resp. AES-256).”
• “Since its publication in 1996, Fault Analysis has become the most efficient way to attack cryptosystems implemented on embedded devices such as smart cards. In October 2000, Rijndael was selected as AES and since then many researchers have studied this algorithm in order to find very efficient differential fault attacks. Amongst the dozen DFA on AES published so far, Piret and Quisquater’s attack published at CHES 2003 [5] is now a reference which has been used as a basis for several variants published afterwards.”
• “Therefore the last round key can be recovered by using 8 faulty ciphertexts with faults induced at chosen locations.”
• “From our experiments, an exhaustive search on AES-128 amongst 234 key candidates takes about 8 minutes on average on a 4-core 3.2Ghz Xeon by using a non-optimised C code. Therefore such an attack is practical.”
• “The triviality of the extension of Piret and Quisquater’s attack comes from the fact that, since MixColumns is linear, one can rewrite the last two rounds of the AES as depicted in Fig. 3.” → Figure 3 moves MixColumns into the last round and uses key MC-1(Kr-1) for the second-to-last AddRoundKey operation
• “To conclude, the original P&Q’s DFA on AES can uniquely identify the AES key in the 192 and 256-bit cases by using 4 faulty ciphertexts.”
• “From our experiments, an exhaustive search on AES-192 amongst 242 key candidates takes about 1.5 day on average on a 4-core 3.2Ghz Xeon by using a non-optimised C code. Therefore such an attack can be classified as practical.”
• “To conclude, one can reduce the number of candidates for the AES-192 key to 210 by using 3 faulty ciphertexts.”

summary

Good paper. Ideas are immediate. Not all attacks presented give runtimes, but the (more important) number of faulty ciphertexts is analyzed properly. The original P&Q attack is neat in general.

The original attack from 2003 uniquely identifies the key with 2 faulty ciphertext pairs with probability 98%.

typo

Figure 2 swaps the last SubBytes and ShiftRows operation

### SEVurity: No Security Without Integrity §

Title: “SEVurity: No Security Without Integrity” by Luca Wilke, Jan Wichelmann, Mathias Morbitzer, Thomas Eisenbarth [url] [dblp]
Published in 2020 at IEEE Symposium on Security and Privacy 2020 and I read it in 2020-06
Abstract: One reason for not adopting cloud services is the required trust in the cloud provider: As they control the hypervisor, any data processed in the system is accessible to them. Full memory encryption for Virtual Machines (VM) protects against curious cloud providers as well as otherwise compromised hypervisors. AMD Secure Encrypted Virtualization (SEV) is the most prevalent hardware-based full memory encryption for VMs. Its newest extension, SEV-ES, also protects the entire VM state during context switches, aiming to ensure that the host neither learns anything about the data that is processed inside the VM, nor is able to modify its execution state. Several previous works have analyzed the security of SEV and have shown that, by controlling I/O, it is possible to exﬁltrate data or even gain control over the VM’s execution. In this work, we introduce two new methods that allow us to inject arbitrary code into SEV-ES secured virtual machines. Due to the lack of proper integrity protection, it is sufﬁcient to reuse existing ciphertext to build a high-speed encryption oracle. As a result, our attack no longer depends on control over the I/O, which is needed by prior attacks. As I/O manipulation is highly detectable, our attacks are stealthier. In addition, we reverse-engineer the previously unknown, improved Xor-Encrypt-Xor (XEX) based encryption mode, that AMD is using on updated processors, and show, for the ﬁrst time, how it can be overcome by our new attacks.

Intel SGX's role

Intel Software Guard Extensions (SGX) was the first widely available solution for protecting data in RAM. However, it only can protect a small chunk of RAM, not the VM as a whole.

Memory encryption systems for VMs

• AMD Secure Memory Encryption (SME) (2016): drop-in, AES-based RAM encryption. Controlled by Secure Processor (SP) co-processor. A special bit in the page table – the so-called C-bit – is used to indicate whether a page should be encrypted. The page table in the guest is encrypted and thus not accessible by the hypervisor.
• AMD Secure Encrypted Virtualization (SEV): SEV extends SME for VMs by using different encryption keys per VM, in order to prohibit the hypervisor from inspecting the VM’s main memory
• SEV Encrypted State (SEV-ES)
• Intel Total Memory Encryption (TME)
• Intel Multi-Key Total Memory Encryption (MKTME)

Comparison: S. Mofrad, F. Zhang, S. Lu, and W. Shi, “A comparison study of intel sgx and amd memory encryption technology,”, 2018

Notes

• Why is the tweak value only defined for index ≥4? “We denote the first one as t 4 , since there are no dedicated constants for the least significant bits 3 to 0.” seems to be a resulting condition, not a reason.
• Is the linear equation system resulting from Dec_K(Enc_N(…), q_j) a linear equation system? XOR with m makes it affine linear, doesn't it? Is bit(i, p) linear? I don't think so unless we are in ℤ_2?!
• See Table 1. They used 4-byte repeating patterns for the tweak constants.
• 0x7413 is “jump if equal” on x86_64, but 0xEB13 is “jump unconditionally”
• page 8: 16 MB = 16 777 216 bytes = 134 217 728 bits.
bytes = 16 * 1024**2
bits = 8 * bytes
(bytes/seconds) / 1024**2  # Mbytes/sec
⇒ 0.4266666666666667
(bits/seconds) / 1024**2  # MBits/sec
⇒ 3.4133333333333336    # ok, this occurs in the paper
(bytes/seconds) / 1024    # Kbytes/sec
⇒ 436.9066666666667     # this diverges from the paper with “426.67 KB/s”
• “The first mechanism utilizes the cpuid instruction, which is emulated by the hypervisor and features a 2-byte opcode: Each time cpuid is executed, the hypervisor is called to emulate it.”

On virtual memory with virtual machines

On virtualized systems, two different page tables are used. Within the VM, the VA used by the guest, the Guest Virtual Address (GVA), is translated to the Guest Physical Address (GPA). The GPA is the address which the VM considers to be the PA. However, on the host system itself another page table is introduced, to allow multiple VMs to run on the same physical machine. This second page table is called the Second Level Address Translation (SLAT), or Nested Page Table (NPT) [5]. The NPT translates the GPA into the Host Physical Address (HPA), the actual address of the data in physical memory.

summary

• Point of this paper: AMD SEV lacks integrity protection and thus memory encryption of a virtual machine can be broken by an untrusted hypervisor. Hypervisor can thus modify memory content (i.e. data or instructions) of a VM arbitrarily.
• The researchers determined tweak values T(p) using a linear equation system. The AMD Epyc Embedded processor line was considered. AMD Epyc 7251 (released June 2017) used the XE encryption mode whereas AMD Epyc (released Feb 2018) uses the XEX encryption mode
• Xor-Encrypt (XE): Enc_K(m, p) := AES_K(m ⊕ T(p)) and Dec_K(c, p) := AES⁻¹_K(c) ⊕ T(p)
• Xor-Encrypt-Xor (XEX): Enc_K(m, p) := AES_K(m ⊕ T(p)) ⊕ T(p) and Dec_K(c, p) := AES⁻¹_K(c ⊕ T(p)) ⊕ T(p)
• We never know the key K. Thus we cannot decrypt text and encrypt it again. But since we know T(p) where p is chosen to be physical address bit (i.e. address of a 16-byte block), then moving blocks changes the ciphertext in a predictable way
• XE: Dec_K(Enc_K(m, p) , q) = AES⁻¹_K(AES_K(m ⊕ T(p))) ⊕ T (q_j) = m ⊕ T(p) ⊕ T(q_j) = m ⊕ T (p ⊕ q_j)
• When can we make changes in the memory? Whenever control from the VM is handed back from the VM to the hypervisor.
• We can simply remove the executable bit from the page and a page fault will be triggered. Then control will be returned back. Then we also know the address p. Adding the executable bit again and returning control allows to continue instructions stepwise (stepwise execution is a common SGX attack strategy, but unimportant for us)
• Or we can inject the cpuid instruction. The hypervisor is responsible for emulating/answering cpuid by copying the return values to a predefined memory segment.
• So we can move pages without screwing up the encryption. But how can we actually modify memory content? We define an encryption oracle:
1. At boot time, cpuid is issued. We filp bits and create a loop to repeatedly call cpuid.
2. The hypervisor provides content as return value of cpuid. The kernel in the VM encrypts it. Thus the hypervisor can read the encrypted version of the plaintext it provided.

Trust in cloud providers

One reason for not adopting cloud services is the required trust in the cloud provider: As they control the hypervisor, any data processed in the system is accessible to them.

Tweakable block ciphers

One popular method for storage encryption are tweakable block ciphers, such as AES-XTS [1], which is, e.g., used in Apple’s FileVault, MS Bitlocker and Android’s file-based encryption. Tweakable block ciphers provide encryption without data expansion as well as some protection against plaintext manipulation. A tweak allows to securely change the behavior of the block cipher, similar to instantiating it with a new key, but with little overhead.

XEX and Xor-Encrypt (XE) are methods to turn a block cipher such as AES into a tweakable blockcipher, where a tweak-derived value is XORed with the plaintext before encryption (and XORed again after encryption in the case of XEX).

typo

“before it continuous its operation” → “before it continues its operation.”

### Software-based Power Side-Channel Attacks on x86 §

Title: “Software-based Power Side-Channel Attacks on x86” by Moritz Lipp, Andreas Kogler, David Oswald, Michael Schwarz, Catherine Easdon, Claudio Canella, Daniel Gruss [url] [dblp]
Published in 2020-11 at and I read it in 2020-12
Abstract: Power side-channel attacks exploit variations in power consumption to extract secrets from a device, e.g., cryptographic keys. Prior attacks typically required physical access to the target device and specialized equipment such as probes and a high-resolution oscilloscope.

questions

• “Fig. 1: A histogram of the power consumption of various instructions on the i7-6700K (desktop) system.”
with the same or different operands?
• How does SGX-Step work? Zero-stepping? I don't get Figure 7
• Interpreting Figure 9
• “However, using zero stepping (Section IV-B2) and the possibility to observe the Hamming weight of bytes (Section III-E), masking is insufficient against our attacks on SGX enclaves.”
• “Timing-Independent Covert Channel” which information do you want to transmit?

quotes

• “However, until recently, power analysis attacks had two limitations. First, they primarily targeted small embedded microcontrollers rather than more complex high-performance desktop and server CPUs. Second, software-based attacks relying on the available interfaces were so far not successfully applied on x86 to leak fine-grained information, e.g., cryptographic key bits.”
• “CPA [11] is an extension of DPA, which examines the correlation between variations in the set of traces and a leakage model depending on the value of intermediate values [49].”
• “The Intel Running Average Power Limit (RAPL) mechanism was introduced with the Sandy Bridge microarchitecture to ensure the CPU remains within desired thermal and power constraints [27].”
• “Since Haswell, it has provided three distinct capabilities for controlling average power over timescales of multiple seconds, ~10 ms, and <10 ms (PL1, PL2, and PL3, respectively).”
• “Intel defines four different domains for RAPL: package (PKG), power planes (PP0 and PP1), and DRAM.”
• “Intel generally considers physical side-channel attacks on SGX out of scope. Side channels [9], [73], race conditions, and memory-safety violations are not in the threat model,”
• “Note that while we primarily refer to runtime energy consumption rather than power consumption throughout this work, these are directly related, as power = energy ÷ time.”
• “We performed the experiment on our Intel Xeon E3-1240 v5 (server) system, collecting measurements for all possible byte values for 627 hours.”
• “Moreover, note that Intel RAPL does not provide the energy consumption per core but per processor package. Thus, code executed on other cores have a direct influence on the measurement of a specific piece of code running on one core and, thus, the number of overall measurements increases to average out the noise introduced by the other cores.”
• “the attacker needs to align the recorded traces. The trace needs to contain a distinctive feature, e.g., a distinct peak in power consumption, so that traces can be shifted into alignment with each other. While
a privileged attacker can precisely control the victim’s execution and interrupt it at will, an unprivileged attacker cannot. However, if the attacker can control when the execution of the attacked code begins, or use a trigger signal such as a cache-based side channel [72], then the collected traces can be aligned based on that timing information.”
• “We measured over 96 000 execution runs, yielding an overall attack time of 8.11 h on the E3-1275 v5. The result is illustrated in Figure 7.”
• “Incidentally, we note that the key recovery specifically fails for key bytes 0, 4, 8, and 12, i.e., the first byte of each 4-byte word.”
• “However, using zero stepping (Section IV-B2) and the possibility to observe the Hamming weight of bytes (Section III-E), masking is insufficient against our attacks on SGX enclaves.”

summary

In this paper, we present PLATYPUS attacks which are novel software-based power side-channel attacks on Intel server, desktop, and laptop CPUs. We exploit unprivileged access to the Intel Running Average Power Limit (RAPL) interface that exposes values directly correlated with power consumption, forming a low-resolution side channel.

• Targets Linux and Intel CPUs after Sandy Bridge. Might work on AMD and ARM also.
• Voltage package works best, then core package.
• Zero-Stepping is used as delaying technique
• RAPL allows to maintain power in more detail. Including monitoring in order to utilize cooling appropriately.
• We are not aware of user applications requiring RAPL access. So restriction of access as countermeasure seems reasonable.
• Intel: medium severity
• “Observing Intra-Cacheline Activity” is motivated by Intel's internal symmetric cryptography implementation guidelines

### Templates vs. Stochastic Methods: A Performance Analys… §

Title: “Templates vs. Stochastic Methods: A Performance Analysis for Side Channel Cryptanalysis” by Benedikt Gierlichs, Kerstin Lemke-Rust, Christof Paar [url] [dblp]
Published in 2006 at CHES 2006 and I read it in 2020-06
Abstract: Template Attacks and the Stochastic Model provide advanced methods for side channel cryptanalysis that make use of ‘a-priori’ knowledge gained from a proﬁling step. For a systematic comparison of Template Attacks and the Stochastic Model, we use two sets of measurement data that originate from two diﬀerent microcontrollers and setups. Our main contribution is to capture performance aspects against crucial parameters such as the number of measurements available during proﬁling and classiﬁcation. Moreover, optimization techniques are evaluated for both methods under consideration. Especially for a low number of measurements and noisy samples, the use of a T-Test based algorithm for the choice of relevant instants can lead to signiﬁcant performance gains. As a main result, T-Test based Templates are the method of choice if a high number of samples is available for proﬁling. However, in case of a low number of samples for proﬁling, stochastic methods are an alternative and can reach superior eﬃciency both in terms of proﬁling and classiﬁcation.

ambiguity

• page 2: “a multivariate characterization of the noise” → noise is defined as “noise + performed operation”
• page 3: “for each time instant” → undefined term “time instant”
• page 3: “it is the average mi2 of all available samples” → do we square the average? is it already squared?
• page 3: “(P1, …, Pp)” → what is P?
• page 6: “(II) the number of curves for profiling” → which curves? difference curves? they were only defined for the template method not for the stochastic method

quotes

• “Especially for a low number of measurements and noisy samples, the use of a T-Test based algorithm for the choice of relevant instants can lead to significant performance gains.”
• “However, in case of a low number of samples for profiling, stochastic methods are an alternative and can reach superior efficiency both in terms of profiling and classification.”
• “The underlying working hypothesis for side channel cryptanalysis assumes that computations of a cryptographic device have an impact on instantaneous physical observables in the (immediate) vicinity of the device, e.g., power consumption or electromagnetic radiation”
• approaches depending on number of stages:
• one-stage approach:
1. directly extract key
• two-stage approach:
1. “profiling step”
2. “attack step”
• “Templates were introduced as the strongest side channel attack possible from an information theoretic point of view”
• “This is due to the fact that positive and negative differences between the averages may zeroize, which is desirable to filter noise but hides as well valuable peaks that derive from significant signal differences with alternating algebraic sign.”
• “Templates estimate the data-dependent part ht itself, whereas the Stochastic model approximates the linear part of ht in the chosen vector subspace (e.g., F9) and is not capable of including non-linear parts.”
• “The number of measurements, both during profiling and key extraction, is regarded as the relevant and measurable parameter.”
• “We focus on the number of available samples (side channel quality) since computational complexity is of minor importance for the attacks under consideration.”
• “We focus on the number of available samples (side channel quality) since computational complexity is of minor importance for the attacks under consideration.”
• “Hence, a general statement on which attack yields better success rates is not feasible as this depends on the number of curves that are available in the profiling step. If a large number of samples is available (e.g., more than twenty thousand), the Template Attack yields higher success rates. If only a small number of samples is available (e.g., less than twenty thousand), stochastic methods are the better choice.” (w.r.t. Metric 3)
• “The Stochastic Model’s strength is the ability to “learn” quickly from a small number of samples. One weakness lies in the reduced precision due to the linear approximation in a vector subspace.”
• “The Template Attack’s weakness is its poor ability to reduce the noise in the side channel samples if the adversary is bounded in the number of samples in the profiling step.”
• “The T-Test Template Attack is the best possible choice in almost all parameter ranges.”
• “For example, using N = 200 profiling measurements and N3 = 10 curves for classification it still achieves a success rate of 81.7%.”

summary

Important empirical results. Parameters and assumptions are okay. Results are fundamental and significant. However, the description of the methods (templates, stochastic) are quite bad. Looking up the references is required.

Notes:

• sosd: \sum_{i,j=1}^K (m_i - m_j)^2 for i ≥ j
• sost: \sum_{i,j=1}^K \left(\frac{m_i - m_j}{\sqrt{\frac{\sigma_i^2}{n_i} + \frac{\sigma_j^2}{n_j}}\right)^2 for i ≥ j

### The design of a Unicode font §

Title: “The design of a Unicode font” by C Bigelow, K Holmes [url] [dblp]
Published in 1993 at and I read it in 2020-10
Abstract: The international scope of computing, digital information interchange, and electronic publishing has created a need for world-wide character encoding standards. Unicode is a comprehensive standard designed to meet such a need. To be readable by humans, character codes require fonts that provide visual images — glyphs — corresponding to the codes. The design of a font developed to provide a portion of the Unicode standard is described and discussed.

ambiguity

“The inclusion of non-alphabetic symbols and non-Latin letters in these 8-bit character
sets required font developers to decide whether the assorted symbols and non-Latin letters
should be style-specific or generic.”

A definition of style-specific and generic would be nice.

“Hence, although accents need to be clearly differentiated, they do not need to be emphatic, and, indeed, overly tall or heavy accents can be more distracting than helpful to readers.”

What does empathic mean in this context?

nicely written

“To design such a font is a way to study and appreciate, on a microcosmic scale, the manifold variety of literate culture and history.”

“A ‘roman’ typeface design (perhaps upright would be a less culture-bound term, since a Unicode font is likely to include Greek, Cyrillic, Hebrew, and other scripts)” …

question

One of the advantages of Unicode is that it includes a Generic Diacritical Marks set of ‘floating’ diacritics that can be combined with arbitrary letters. These are not case-sensitive, i.e. there is only one set of floating diacritics for both capitals and lowercase

Isn't this a property of the font file format?

quotes

• “The Unicode standard distinguishes between characters and glyphs in the following way: ‘Characters reside only in the machine, as strings in memory or on disk, in the backing store.”
• “In contrast to characters, glyphs appear on the screen or paper as particular representations of one or more backing store characters. A repertoire of glyphs comprises a font.’”
• “By script or writing system we mean a graphical representation of language.”
• “Accumulated contextual glyphic variations are what transformed capitals into lowercase, for example, and turned roman into italic.”
• “Version 1.0 of the standard encodes approximately 28,000 characters, of which some 3,000 are alphabetic letters and diacritical marks for European, Indic, and Asian languages, 1,000 are various symbols and graphic elements, and 24,000 are ideographic (logographic), syllabic, and phonetic characters used in Chinese, Japanese, and Korean scripts [1,2].”
• “Although the textura typeface used by Gutenberg in the 42-line Bible of 1455-56, the first printed book in Europe, included more than 250 characters, and the humanistica corsiva typeface cut by Francesco Griffo for the Aldine Virgil of 1501, the first book printed in italic, included more than 200 characters, character sets became smaller in later fonts, to reduce the costs of cutting, founding, composing, and distributing type.”
• “Within one font, we wanted the different alphabets and symbols to maintain a single design theme.”
• “A problem with this kind of haphazard development is that the typographic features of documents, programming windows, screen displays, and other text-based images will not be preserved when transported between systems using different fonts.”
• “By ‘harmonization’, we mean that the basic weights and alignments of disparate alphabets are regularized and tuned to work together, so that their inessential differences are minimized, but their essential, meaningful differences preserved.”
• “Latinate typographers”
• “As designers, we would like to know how the ‘rules’ that govern legibility in non-Latin scripts compare to the rules for Latin typefaces. Shimron and Navon, for example, report a significant difference in the spatial distribution of distinctive features in the Roman and Hebrew alphabets [20].”
• “Although many typographic purists believe that simple obliques are inferior to true cursives, Stanley Morison argued in 1926 that the ideal italic companion to roman should be an inclined version of the roman [21].”
• “The European mode of distinction between formal and cursive type forms is not as strong a tradition in some non-Latin scripts, e.g. Hebrew (though there may be other kinds of highly regulated graphical distinctions), so a simple oblique is a more universal, albeit minimal, graphic distinction that can apply equally to all non-Latin scripts.”
• “we concluded that, for improved legibility in international text composition, accents and diacritics should be designed somewhat differently than in the standard version of Lucida Sans.”
• “Accordingly, we designed the lowercase diacritics of Lucida Sans Unicode to be slightly taller and a little different in modulation than those of the original Lucida Sans. Following current practice, we used the lowercase accents to compose accented capitals.”
• “One of the advantages of Unicode is that it includes a Generic Diacritical Marks set of ‘floating’ diacritics that can be combined with arbitrary letters. These are not case-sensitive, i.e. there is only one set of floating diacritics for both capitals and lowercase. In our first version of Lucida Sans Unicode, we implemented these as lowercase diacritics and adjusted their default position to float over the centre of a lowercase o. Ideally, there should be at least two sets of glyphs, one for lowercase and one for upper case (determined automatically by the text line layout manager of the OS or application), along with a set of kerning tables that optimizes the visual appearance of each combination of letter + diacritic.”
• “In a proportional font (like the Times Roman before the eyes of the reader), the advance width of a character is proportional to the geometric logic of its form. A proportionally-spaced m, which has a spatial frequency of three cycles per letter, is wider than an n, which has a frequency of two cycles, which in turn is wider than an i, which has one.”
• “In a fixed pitch font like Courier, all characters are of the same width, so that m is cramped and i extended, and ‘minimum’ has an irregular rhythm, since the spatial frequency of the letters is continually changing within the fixed width of the cells.”
• “Among the alphabetic Unicode character sets, Cyrillic poses an interesting problem in fixed-pitch mode because it has, compared to the Latin, a greater percentage of characters with higher spatial frequency (three or more cycles per letter) on the horizontal axis. The Hebrew alphabet, on the other hand, is more easily transformed to fixed-pitch mode because it has many fewer letters of high spatial frequency.”
• “While the assumption that characters are fully contained within cells was invariably true for primitive terminal and TTY fonts, it is not necessarily true of typographic fonts. In many PostScript and TrueType digital fonts, diacritics on capitals extend above the nominal upper boundary of the body of the font because, in an effort to reduce font file size, capital and lowercase accented characters are built as composites in which letters and diacritics are treated as subroutines,”
• “The lowercase form of most diacritics is taller than the capital form,”
• “A font that would contain all of Unicode 1.0 would be of daunting size, and the standard continues to grow as the Unicode committee adds more characters to it. Even without the Chinese/Japanese/Korean set, the alphabets and symbols comprise almost 4,000 separate characters, sixteen times larger than the usual 8-bit character sets.”
• “To call an incomplete font containing Unicode subsets a ‘Unicode’ font could be misleading, since some users could mistakenly assume that any font called ‘Unicode’ will contain a full set of 28,000 characters.”
• “The Japanese term ‘gothic’ is equivalent to ‘sans serif’, and is also used in English for sans serif faces, particularly those of 19th-century American design, e.g. Franklin Gothic and News Gothic.”
• “To satisfy the French critics and give Times greater appeal in the French market, the Monotype Corporation cut a special version of the face in accord with the dictates of Maximilien Vox, a noted French typographic authority [15,26]. Vox’s re-design of some fourteen characters brought Times slightly closer to the sophisticated style of the French Romain du Roi, cut by Philippe Grandjean circa 1693.”
• “For the German typographic market, Monotype cut a version of Times with lighter capitals that are less distracting in German orthography, where every noun is capitalized”
• “Robert Granjon” … “His ecclesiastical patrons in Rome called him ‘the excellent . . .’, ‘the most extraordinary . . .’, ‘the best ever . . .’ cutter of letters [31].”
• “we followed a traditional Hebrew thick/thin modulation, in which horizontal elements are thicker than vertical – the opposite of the Latin convention – but weighted the Hebrew characters to have visual ‘presence’ equivalent to that of the Latin.”
• “Unicode is a character encoding standard, not a glyph standard.”
• “For example, Unicode treats Latin capital B as one character and lowercase b as another because these are significant differences in Latin orthography.”
• “Unicode does not treat italic b or bold b as separate from b, because those letters are merely allographs, as the linguists would say, the graphic differences not being orthographically significant.”
• “S-cedilla and T-cedilla are used in both Turkish and Romanian, but the cedilla may be rendered as either the French form of cedilla or as a comma-like accent below the letter.”
• “Pike and Thompson [3] discuss the advantage of economical memory management and greater font loading speed when a complete Unicode character set is implemented as many small subfonts from which characters can be accessed independently, and this is the method used in the Plan 9 operating system, both for bitmap fonts and the Lucida Sans Unicode outline fonts. The other method is used in Microsoft Windows NT 3.1, in which the first version of the Lucida Sans Unicode font is implemented as a single TrueType font of 1,740 glyphs. In the current version of Windows NT, this allows a simpler font-handling mechanism, makes the automatic ‘hinting’ of the font easier, since all characters can be analyzed by the hinting software in one pass, and preserves the default design coordination of the subsets, if the font is based on a harmonized set of designs.”

summary

A neat paper. Due to my lack of understanding of the status quo in 1993, I cannot judge on the approach and quality. There are some considerations, I am not familiar with (e.g. why do we need to distinguish only two kinds of diacritics - lowercase and uppercase), but the paper gives a broad overview over design decisions that need to be made, when designing a ‘Unicode’ font. They designed 1700 characters, but Unicode 1.1 specifies 34,168 characters. It is part of the paper to discuss “One big font vs. many little fonts”.

• “Our initial motivation in designing a Unicode font was to provide a standardized set of glyphs that could be used as a default core font for different operating systems and languages. Within one font, we wanted the different alphabets and symbols to maintain a single design theme.”
• “Having described reasons in favor of creating a Unicode font, we should also discuss arguments against such an undertaking, and various problems we encountered.”
• “Ars longa, vita brevis” (i.e. huge development effort)
• “Culture-bound design” (i.e. typographers only ‘truely’ understand the writing system they use in their culture)
• “Homogenization” (“If it erases distinctive differences between scripts, it increases the possibility of confusion”)
• “Character standard vs. glyph standard”
• “One big font vs. many little fonts”

### Too Much Crypto §

Title: “Too Much Crypto” by Jean-Philippe Aumasson [url] [dblp]
Published in 2020-01-03 at Real-World Crypto 2020 and I read it in 2020/01
Abstract: We show that many symmetric cryptography primitives would not be less safe with significantly fewer rounds. To support this claim, we review the cryptanalysis progress in the last 20 years, examine the reasons behind the current number of rounds, and analyze the risk of doing fewer rounds. Advocating a rational and scientific approach to round numbers selection, we propose revised number of rounds for AES, BLAKE2, ChaCha, and SHA-3, which offer more consistent security margins across primitives and make them much faster, without increasing the security risk.

ambiguity

“Where we examine the reasons behind the number of rounds, comment on the risk posed by quantum computers, and finally propose new primitives for a future where less energy is wasted on computing superfluous rounds.”

Is this an English sentence?

notes

• “Designed in the 1970’s, neither DES nor GOST are practically broken by cryptanalysis.”
• “(We restrict this reassuring outlook to symmetric primitives, and acknowledge that spectacular failures can happen for more sophisticated constructions. An example is characteristic-2 supersingular curves’ fall from 128-bit to 59-bit security [32].)”
• “The speed of symmetric primitives being inversely proportional to their number of rounds, a natural yet understudied question is whether fewer rounds would be sufficient assurance against cryptanalysis’ progress.”
• “We conclude by proposing reduced-round versions of AES, BLAKE2, ChaCha, and SHA-3 that are significantly faster yet as safe as their full-round versions.”
• “in 2009 Bruce Schneier wrote this [51]:”
Cryptography is all about safety margins. If you can break n rounds of a cipher, you design it with 2n or 3n rounds. What we’re learning is that the safety margin of AES is much less than previously believed. And while there is no reason to scrap AES in favor of another algorithm, NIST should increase the number of rounds of all three AES variants. At this point, I suggest AES-128 at 16 rounds, AES-192 at 20 rounds, and AES-256 at 28 rounds. Or maybe even more; we don’t want to be revising the standard again and again.
• “128-bit security is often acknowledged as sufficient for most applications”
• “at the time of writing mining a Bitcoin block requires approximately 2 74 evaluations of SHA-256.”
• “Lloyd [46] estimated that ‘[the] Universe could currently register 1090 [or 2299] bits. To register this amount of information requires every degree of freedom of every particle in the Universe’. Applying a more general bound and the holographic principle, Lloyd further calculates that the observable Universe could register approximately 2399 bits by using all the information capacity matter, energy, and gravity.”
• “Unlike complexity theoretic estimates that use asymptotic notations such as O(n log n) where n is the problem size, cryptanalysts work with fixed-length values and can’t work with asymptotics.”
• “complexities in cryptanalysis papers ignore the fact that a memory access at a random address is typically orders of magnitude slower than simple arithmetic operations.”
• “the area-time (AT) metric model, where the attack cost is viewed as the product between area and time requirements”
• “complexities in cryptanalysis papers ignore the fact that a memory access at a random address is typically orders of magnitude slower than simple arithmetic operations.”
• “This example stresses that the area-time (AT) metric model, where the attack cost is viewed as the product between area and time requirements, is more realistic than the model where only time is considered”
• “Grigg and Gutmann called ‘cryptographic numerology’”
• “Using any standard commercial risk management model, cryptosystem failure is orders of magnitude below any other risk.”
• “For example, the greatest risks with e-voting systems are not the cryptographic protocols and key lengths, but the operational and information security concerns.”
• “Our proposed categories are:
• Analyzed: […]
• Attacked: […]
• Wounded: […]
• Broken: […]”
• “Schneier’s law is the tautological-sounding statement ‘Attacks always get better, they never get worse’, which Bruce Schneier popularized, and that (he heard) comes from inside the NSA.”
• “Rarely have number of rounds been challenged as too high. A possible reason (simplifying) is that people competent to constructively question the number of rounds have no incentive to promote faster cryptography, and that those who don’t have the expertise to confidently suggest fewer rounds.”
• “Although Grover reduces key search from O(2n) to O(2n/2), one shouldn’t ignore the constant factors hiding in the O(). Translating this asymptotic speed-up into a square-root of the actual cost is a gross oversimplification; between constant factors, the size and cost of a quantum circuit implementing the attacked primitive, the lack of parallelism [30], and the latency of the circuit, it’s actually unclear, given today’s quantum computing engineering knowledge, whether Grover would actually be more cost-efficient than classical computers. It’s nonetheless a safe bet to assume that it would be.”
• “Anyway, the number of rounds would not matter much would AES be Groverable, the answer to that question is therefore not important in choosing a number of rounds.”
• “[…] we propose the following:
• AES: 9 rounds instead of 10 for AES-128, 10 instead of 12 for AES-192, 11 instead of 14 for AES-256, yielding respectively a 1.1×, 1.2×, and 1.3× speed-up.
• BLAKE2: 8 rounds instead of 12 for BLAKE2b, 7 rounds instead of 10 for BLAKE2s (we’ll call these versions BLAKE2bf and BLAKE2sf), yielding respectively a 1.5× and 1.4× speed-up.
• ChaCha: 8 rounds instead of 20 (that is, ChaCha8), yielding a 2.5× speed-up.
• SHA-3: 10 rounds instead of 24 (we’ll call this version KitTen, inspired by Keccak family member KangarooTwelve), yielding a 2.4× speed-up.”

Good paper; its survey is its strong suit. However, the particular choice of proposed parameters has little justification

typo

If all the best cryptanalysts could find was a distinguisher in the chosen-ciphertext related-key model, then even less likely are practical key recovery attack in the chosen-plaintext model.

typo

But as as noted in §2, numbers such as

### Tweaks and Keys for Block Ciphers: The TWEAKEY Framewo… §

Title: “Tweaks and Keys for Block Ciphers: The TWEAKEY Framework” by Jérémy Jean, Ivica Nikolić, Thomas Peyrin [url] [dblp]
Published in 2014 at ASIACRYPT 2014 and I read it in 2020-06
Abstract: We propose the TWEAKEY framework with goal to unify the design of tweakable block ciphers and of block ciphers resistant to related-key attacks. Our framework is simple, extends the key-alternating construction, and allows to build a primitive with arbitrary tweak and key sizes, given the public round permutation (for instance, the AES round). Increasing the sizes renders the security analysis very diﬃcult and thus we identify a subclass of TWEAKEY, that we name STK, which solves the size issue by the use of ﬁnite ﬁeld multiplications on low hamming weight constants. We give very eﬃcient instances of STK, in particular, a 128-bit tweak/key/state block cipher Deoxys-BC that is the ﬁrst AES-based ad-hoc tweakable block cipher. At the same time, Deoxys-BC could be seen as a secure alternative to AES-256, which is known to be insecure in the related-key model. As another member of the TWEAKEY framework, we describe Kiasu-BC, which is a very simple and even more eﬃcient tweakable variation of AES-128 when the tweak size is limited to 64 bits.

Properties of tweakable block ciphers:

• formalized in 2002 by Liskov et al.
• tweaks are completely public, keys are not
• retweaking (changing the tweak value) is less costly than changing its secret key
• security model considers that the attacks has full control over both: the message and the tweak inputs

quotes (on contributions)

• “We propose the TWEAKEY framework with goal to unify the design of tweakable block ciphers and of block ciphers resistant to related-key attacks.”
• “We give very efficient instances of STK, in particular, a 128-bit tweak/key/state block cipher Deoxys-BC that is the first AES-based ad-hoc tweakable block cipher.”
• “we describe Kiasu-BC, which is a very simple and even more efficient tweakable variation of AES-128 when the tweak size is limited to 64 bits.”

quotes (on history)

• “[…] designs that allowed to prove their security against classical differential or linear attacks have been a very important step forward, […]”
• “The security of the block ciphers, both Feistel and Substitution-Permutation networks, has been well studied when the key is fixed and secret, however, when the attacker is allowed to ask for encryption or decryption with different (and related) keys the situation becomes more complicated.”
• “Most key schedule constructions are ad-hoc, in the sense that the designers came up with a key schedule that is quite different from the internal permutation of the cipher, in a hope that no meaningful structure is created by the interaction of the two components.”
• “This extra input T, later renamed as tweak, was supposed to be completely public and to randomize the instance of the block cipher: to different values of T correspond different and independent families of permutations EK
• “This feature was formalized in 2002 by Liskov et al., who showed that tweakable block ciphers are valuable building blocks if retweaking (changing the tweak value) is less costly than changing its secret key.”
• “disk encryption where each block is ciphered with the same key, but the block index is used as tweak value.”
• “Simple constructions of a tweakable block cipher EK(T, P) based on a block cipher EK(P), like XORing the tweak into the key input and/or message input, are not satisfactory. For example, only XORing the tweak into the key input would result in an undesirable property that EK(T, P) = EK ⊕ X(T ⊕ X, P).”

Non-intuitive results on birthday paradox bounds

• “More importantly, these methods ensure only security up to the birthday-bound (relative to the block cipher size).”
• “Minematsu [46] partially overcomes this limitation by proving beyond birthday-bound security for his design, but at the expense of a very
reduced efficiency.”

Future work

• “As of today, it remains an open problem to design an ad-hoc AES-like tweakable block cipher, which in fact would be very valuable for authenticated encryption as AES-NI instruction sets guarantee extremely fast software implementations.”

Tweakey

• “we emphasize that not all TWEAKEY instances are secure”
• “E is a key-alternating cipher when the general form f(si, Ki) = si+1 for i < r becomes f(si ⊕ Ki) = si+1
• “The signature of standard block ciphers can be described as E: {0, 1}k ×{0, 1}n → {0, 1}n where an n-bit plaintext P is transformed into an n-bit ciphertext C = E(K, M) using a k-bit key K.”
• “The signature for a tweakable block cipher therefore becomes E : {0, 1}k × {0, 1}t × {0, 1}n → {0, 1}n, the ciphertext C = E(K, T, P ) where the tweak T does not need to be secret and thus can be placed in the public domain.”
• “It is important to note that the security model considers that the attacker has full control over both the message and the tweak inputs.”
• related-tweakey := related-key related-tweak
open-tweakey := open-key open-tweak
• Figure 3 shows the TWEAKEY design, where the top wires transmit t+k bits, g outputs k bits and the bottom wires transmit n bits
• subtweakey extraction function g
• internal state update permutation f
• tweakey state update function h
• “This can be summarized as: si+1 = f(si ⊕ g(tki)) followed by tki+1 = h(tki)”
• “The functions f, g and h must be chosen along with the number of rounds r such that no known attack can apply on the resulting primitives.”
• “One of the main causes for the low number of ad-hoc tweakable block ciphers is the fact that adding a tweak input makes the security analysis much harder.”
• “The trick we use is to apply a nibble-wise multiplication with a distinct coefficient α j for all tweakey words.”
• “when we deal with differences in several tweakey words (which is supposedly very hard to analyze due to the important number of nibbles), the study of the STK construction is again the same as for a classical TK-1 analysis, except that at most p − 1 active output nibbles can be erased in each subgroup.”
• Figure 4 shows the STK construction
• “The chosen round functions (and the nibble sizes), suggest that Deoxys-BC is software oriented, while Joltik-BC is hardware (and lightweight) oriented design.”
• “For instance, XORing two columns (instead of rows) would immediately lead to an insecure variant.” (context Kiasu-BC)

Performance and area

• “a complete 128-bit tweak 128-bit key 128-bit block cipher proposal Deoxys-BC based on the AES round function, but faster and more lightweight than other tentatives to build a tweakable block cipher from AES-128. When used in ΘCB3 [38] authenticated encryption, Deoxys-BC runs at about 1.3 c/B on the latest Intel processors. This has to be compared to OCB3, which runs at 0.7-0.88 c/B when instantiated with AES-128, but only ensures birthday-bound security. Alternatively, Deoxys-BC could be a replacement for AES-256,
which has related-key issues as shown in [8].”
• “On longer inputs and modes based on parallelizable block cipher calls (such as ΘCB3), Deoxys-BC-256 runs at around 1.3 cycles per byte, while Deoxys-BC-384 at around 1.55 cycles per byte. This is to be compared to AES in OCB3 mode, which runs at around 0.70 - 0.88 cycles per byte (but has only birthday bound security).”
• “Therefore, we estimate that the entire Deoxys-BC-256 can be implemented with around 3400 GE, and Deoxys-BC-384 with around 4400 GE.”

summary

I think this paper is some good research product. I am not an expert on symmetric cryptography and cannot judge upon the security analysis and possible attacks, but to me it seems to consider relevant properties. Unrelated to the paper, I was not aware of beyond-birthday-attack-security which totally intrigued me. Related to the paper, follow-up work could be made regarding the question “What are sufficient conditions to achieve a secure tweakable block cipher with Tweakey?”. Well done!

• Tweakey framwork
• Kiasu-BC (tweak size = 64 bits, AES-128 based)
• STK (goal: ease of cryptanalysis with existing tools)
• Deoxys-BC (n=128 bit blocks input, f = AES round function)
• Deoxys-BC-256
• Deoxys-BC-384
• Joltik-BC (n = 64 bits, f = AES-like using 4-bit nibbles)
• Joltik-BC-128 (r = 24 rounds)
• Joltik-BC-192 (r = 32)
• ““The QARMA Block Cipher Family’ by Roberto Avanzi” is a follow-up on this work

typo

• “these scheme might not be really efficient” → “these schemes might not be really efficient”
• “This might be seen as counter intuitive as it is required the tweak input to be somehow more efficient than the key input,” → “This might be seen as counter intuitive as it requires the tweak input to be somehow more efficient than the key input,”
• “but at the same time the security requirement on the tweak seem somehow stronger than on the key,” → “but at the same time the security requirement on the tweak seems somehow stronger than on the key,”
• “and then multiply each c-bit cell of the j-th” → and then multiplies each c-bit cell of the j-th
• “Most automated differential analysis tools for AES-like ciphers (e.g., [9, 23,28]) use truncated differential
representation to make feasible the search for differential characteristics.” → “Most automated differential analysis tools for AES-like ciphers (e.g., [9, 23,28]) use truncated differential representation to make the search for differential characteristics feasible.”

### When a Patch is Not Enough - HardFails: Software-Explo… §

Title: “When a Patch is Not Enough - HardFails: Software-Exploitable Hardware Bugs” by Ghada Dessouky, David Gens, Patrick Haney, Garrett Persyn, Arun Kanuparthi, Hareesh Khattri, Jason M. Fung, Ahmad-Reza Sadeghi, Jeyavijayan Rajendran [url] [dblp]
Published in 2018-12-01 at and I read it in 2020-07
Abstract: Modern computer systems are becoming faster, more efﬁcient, and increasingly interconnected with each generation. Consequently, these platforms also grow more complex, with continuously new features introducing the possibility of new bugs. Hence, the semiconductor industry employs a combination of different veriﬁcation techniques to ensure the security of System-on-Chip (SoC) designs during the development life cycle. However, a growing number of increasingly sophisticated attacks are starting to leverage cross-layer bugs by exploiting subtle interactions between hardware and software, as recently demonstrated through a series of real-world exploits with signiﬁcant security impact that affected all major hardware vendors.

HCF

“A behavior humorously hinted at in IBM System/360 machines in the form of a Halt-and-Catch-Fire (HCF) instruction.”

→ See also Christopher Domas' research on x86 since 2017
→ Pentium F00F (C7C8) bug

quotes

• “This approach does not ensure security at the hardware implementation level. Hardware vulnerabilities can be introduced due to: (a) incorrect or ambiguous security specifications, (b) incorrect design, (c) faulty implementation of the design, or (d) a combination thereof.”
• “To detect such bugs, the semiconductor industry makes extensive use of a variety of verification and analysis techniques, such as simulation and emulation (also called dynamic verification)”
• “industry-standard tools include Incisive, Solidify, Questa Simulation and Questa Formal, OneSpin 360, and JasperGold”
• “This process incorporates a combination of many different techniques and toolsets such as RTL manual code audits, assertion-based testing, dynamic simulation, and automated security verification.”
• “recent outbreak of cross-layer bugs” with 15 reference appended ^^
• “To reproduce this effect, we implemented the list of bugs using two popular and freely available processor designs for the widely used open-source RISC-V architecture.”
• “Specifically, we observe that RTL bugs arising from complex and cross-modular interactions in real-world SoCs render RTL bugs extremely difficult to detect in practice. Further, it may often be feasible to exploit them from software to compromise the entire platform, and we call such bugs HardFails.”
• “As all vendors keep their proprietary industry designs and implementations inaccessible, we use the popular open-source RISC-V architecture and hardware micro-architecture as a baseline”
• “We investigated how these vulnerabilities can be effectively detected using formal verification techniques (Section V) using an industry-standard tool and in a second case study through simulation and manual RTL analysis (Section VI).”
• “As a result, real-world SoCs can easily approach 100,000 lines of RTL code, and some open-source designs significantly outgrow this to many millions lines of code”
• “However, since RTL code is usually compiled and hardwired as integrated circuitry logic, the underlying bugs will remain and cannot, in principle, be patched after production. This is why RTL bugs pose a severe security threat in practice.”
• “We call these the HardFail properties of a bug:”
• “Cross-modular effects (HF-1).” […]
• “Timing-flow gap (HF-2).” […] “In practice, this leads to vast sources of information leakage due to software-exploitable timing channels (see Section IX).” […]
• “Cache-state gap (HF-3).” […] “In particular, current tools reason about the architectural state of a processor by exclusively focusing on the state of registers. However, this definition of the architectural state completely discards that modern processors feature a highly complex microarchitecture and diverse hierarchy of non-register caches. This problem is amplified as these caches have multiple levels and shared across multiple privilege levels. Caches represent a state that is influenced directly or indirectly by many control-path signals.” […]
• “Hardware/firmware interactions (HF-4).” […] “Hence, reasoning on whether an RTL bug exists is inconclusive when considering the hardware RTL in isolation.” […]
• “On analyzing the RTL of Ariane, we observed that TLB page faults due to illegal accesses occur in a different number of clock cycles than page faults that occur due to unmapped memory (we contacted the developers and they acknowledged the vulnerability).”
• “Once the instruction is retired, the execution mode of the core is changed to the unprivileged level, but the entries that were prefetched into the cache (at the system privilege level) do not get flushed.”
• “We emphasize that in a real-world security testing (see Section II), engineers will not have prior knowledge of the specific vulnerabilities they are trying to find. Our goal, however, is to investigate how an industry-standard tool can detect RTL bugs that we deliberately inject in an open-source SoC and have prior knowledge of (see Table I).”
• “Our results in this study are based on two formal techniques: Formal Property Verification (FPV) and Security Path Verification (SPV).”
• “To describe our assertions correctly, we examined the location of each bug in the RTL and how it is manifested in the behavior of the surrounding logic and input/output relationships. Once we specified the security properties using assert, assume and cover statements, we determined which RTL modules we need to model to prove these assertions.”
• “Out of the 31 bugs we investigated, shown in Table I, using the formal verification techniques described above, only 15 or 48%, were detected. While we tried to detect all 31 bugs formally, we were only able to formulate security properties for only 17 bugs. This indicates that the main challenge with using formal verification tools is identifying and expressing security properties that the tools are capable of capturing and checking.”
• “Our results, shown in the SPV and FPV bars of Figure 3, indicate that integer overflow and address overlap bugs had the best detection rates, 80% and 100%, respectively.”
• “The implications of these findings are especially grave for real-world more complex SoC designs where these bug classes are highly relevant from a security standpoint.”
• “we replaced the PULP_SECURE variable, which controls access privileges to the registers, with the PULP_SEC variable.”
• “We present next the results of our second case study. 54 teams of researchers participated in Hack@DAC 2018, a recently conducted capture-the-flag competition to identify hardware bugs that were injected deliberately in real-world open-source SoC designs. This is the equivalent of bug bounty programs that semiconductor companies offer”
• “The goal is to investigate how well these bugs can be detected through dynamic verification and manual RTL audit without prior knowledge of the bugs.”
• “This RTL vulnerability manifests in the hardware behaving in the following way. When an error signal is generated on the memory bus while the underlining logic is still handling an outstanding transaction, the next signal to be handled will instead be considered operational by the module unconditionally.”
• “While existing industry SoCs support hot-fixes by microcode patching, this approach is inherently limited to a handful of changes to the instruction set architecture, e.g., modifying the interface of individual complex instructions and adding or removing instructions. Thus, such patches at this higher abstraction level in the firmware only act as a "symptomatic" fix that circumvent the RTL bug.”
• “VeriCoq based on the Coq proof assistant transforms the Verilog code that describes the hardware design into proof-carrying code.”
• “Finally, computational scalability to verifying real-world complex SoCs remains an issue given that the proof verification for a single AES core requires 30 minutes to complete”
• “Murφ model”
• “Information flow analysis (such as SPV) are better suited for this purpose where a data variable or input is assigned a security label (or a taint), and the taint propagation is monitored.”
• “IFT techniques are proposed at different levels of abstraction: gate-, RT, and language-levels.”
• “At the language level, Caisson and Sapper are security-aware HDLs that use a typing system where the designer assigns security “labels” to each variable (wire or register) by the security policies required. However, they both require redesigning the RTL using a new hardware description language which is not practical. SecVerilog [33, 100] overcomes this by extending the Verilog language with a dynamic security type system.”
• “In the Meltdown attack, speculative execution can be exploited on modern processors (affecting all main vendors) to completely bypass all memory access restrictions.”
• “a selection multiplexer to select between AES, SHA1, MD5, and the temperature sensor.”

summary

A technical report discussing how bugs are introduced in the hardware design process and slip through testing tools. Specifically, they define HardFails as RTL bugs that are difficult to detect and can be triggered from software potentially compromising the entire platform. Their classification in 4 HardFails properties {Cross-modular effects, Timing-flow gap, Cache-state gap, Hardware/Firmware-interactions} is non-exhaustive (as pointed out in the conclusion). In my opinion, it is too copious when discussing the hardware development process and laying out the advantages/disadvantages of various tools. I think it could have been more concise (e.g. in “Proof assistant and theorem-proving” the scalability issue is mentioned twice).
Besides that, I think it give a nice overview over the issues hardware design has to deal with and yes, we need better tool support. But there is (obviously) no solution to the mentioned problems in the paper.
Catherine pointed out that CLKScrew in Table 2 does not need Firmware interaction and Foreshadow has nothing to do with Firmware interaction.