Contact tracing and the MIT Safepaths project

✍️ Written on 2020-04-04 in 1795 words. Part of cs IT-security privacy

Globally researchers and developers are searching for solutions of digital contact tracing. The usecase is to register all contacts you meet. Once, infection of the disease has been determined, all contacts can be notified and isolated to prevent spread of the disease. The urgent usecase is obviously the Covid-19 pandemic.

I work at an institute specializing in cryptography and software verification. Obviously many researchers at our institute care about the privacy of user’s data. Several global projects raised privacy concerns. In particular, China and Russia use mass surveillance techniques to track citizens. I am glad we don’t have such technology in place in Austria, but still some initiatives caused privacy concerns. In particular, mobile carrier A1 shared results from a motion analysis application developed by Invenium (via finance.yahoo.com). Furthermore, national broadcast agency ORF gave the Red Cross humanitarian organization an opportunity to promote the newly-developed Stopp Corona App.

As researchers, we met online for the second time to discuss solutions and I volunteered to look at MIT’s Safepaths project. Since this topic rised to public attention, I wanted to share my notes.

Privacy? Why?

Incidents mentioned in the whitepaper:

  • “South Korea sends text text messages containing personal information about diagnosed carriers to inform citizens.” (cf. page 5)

  • “In the US, Nebraska and Iowa published information of where diagnosed carriers have been through media outlets and government websites.” (cf. page 5)

  • “[Some of my patients] were more afraid of being blamed than dying of the virus” by Lee Su-young, South Korea (cf. page 7)

  • “In China, users suspect an app developed to help citizens identify symptoms and their risk of carrying a pathogen spies on them and reports personal data to the police. The Google Play store also pulled the Iranian government’s app amid similar fear and South Korea’s app to track those in self-quarantine automatically notifies the user’s case worker if they leave their quarantine zone.” (cf. page 7)

  • “In South Korea, fraudsters quickly began blackmailing local merchants and demanding ransoms to not (falsely) report themselves as sick and having visited the business.” (cf. page 9)

  • “Hackers have successfully infiltrated apps and services collecting sensitive information before, with 92 million accounts from the genealogy and DNA testing service MyHeritage hacked in 2017.” (cf. page 9)

Incidents mentioned in the paper:

  • “Israel approved cellphone tracking for coronavirus patients.” (cf. page 2, footnote)

  • “Israel declared a state of emergency during its 1948 War of Independence, justifying a range of “temporary” measures that removed individual freedoms. They won the War of Independence but never declared their state of emergency over, and many of the “temporary” measures are ongoing. Israel recently approved cellphone tracking of its COVID-19 patients. Similarly COVID-19 surveillance innovations in China are likely to be used by China’s counterterrorism forces beyond the pandemic to further monitor and regulate the movement of its people.” (cf. page 6)

An initial idea

Initially, I had the following idea based on hash algorithms:

Let H be a hash function. Let t be a timestamp unique for any 5 minutes of any day. Let l be a location identifier using GPS as source for one area of about 5m of the rasterized globe. During the day, the user’s smartphone computes H(t ‖ l) every 5 minutes and stores all 288 hashes of a day. If one person gets infected, all hashes are published. Every user’s smartphone determines whether any hash matches. If so, then (with very high probability) both people spent time at the same place and time. Thus, the user shall be isolated as well.

The point of this approach is that users can tell when they met without revealing time and location to everyone. This of course assumes preimage resistance of hash algorithms as commonly found among cryptographic hash functions.

One friend pointed out that the borders of rasterized locations are a problem. Indeed. Two people, 1 meter apart, might be in different areas because the border of two areas cross in between. This problem can be circumvented, in my opinion. Consider that we take the same rasterization a second time shifted by 2.5m. Now, we compute H(t ‖ l₁) and H(t ‖ l₂) instead for rasters 1 and 2 respectively and store both hashes. Now if we find a match in any of these hashes, we consider this as encounter. The border issue occurs with lower probability. Be aware that this concept can be scaled to any granularity.

However, Christian also pointed out another issue: The entropy for the hash function input is too low. In my contrived encoding, we could consider a longitude, latitude and timestamp. Longitude and latitude must be in domain 0, 180) with ([4 or 5 decimal places. Thus we need to encode 18,000,000 different states (with 5 decimal places). This fits into two times 24 bit entropy. Finally, the timestamp has 288 values per day (i.e. 288 5-min intervals). Let us restrict ourselves to 32 years. 288·365·32 (neglecting leap years) = 3,363,840 states. This fits into 22 bits of entropy. This sums up to 70 bits at most. With additional information about the place of living or the date, a brute-force attack might become feasible even without dedicated hardware. Be aware, that 5 bits in this system already get lost, if you just know the year.

The MIT Safepaths project

Then I looked into the MIT Safe Paths project. I want to give a summary:

At page 7 of the paper, you can find a discussion of the rasterization of the geographic space. Two projects are mentioned:

  • Geohash is a public domain geocode system invented in 2008 by Gustavo Niemeyer[1] and (similar work in 1966) G.M. Morton[2], which encodes a geographic location into a short string of letters and digits. It is a hierarchical spatial data structure which subdivides space into buckets of grid shape, which is one of the many applications of what is known as a Z-order curve, and generally space-filling curves. (via Wikipedia)

  • H3 is a hexagonal hierarchical geospatial indexing system. H3 has an easy API for indexing coordinates into a hexagonal, global grid. Easy, bitwise truncation to coarser, approximate cells, along with area compression/decompression algorithms. Along with twelve pentagons, the entire world is addressable in H3, down to square meter resolution. (via H3 homepage)

At page 8 of the paper, the border problem stated above is mentioned. The authors suggest to store adjacent areas as well, increasing the storage requirements depending of the shape of the grid.

At page 9 of the paper illustrates the protocol. They take the same approach hashing all location trail triples (longitude, latitude, timestamp), but additionally establish a Diffie-Hellman protocol between the app and the publication server. Thus, modulus p and primitive root g are public knowledge. The app chooses group element a and the server chooses group element b. The app shares H(t ‖ l)ᵃ and the server shares H(t ‖ l)ᵇ. The server then shares H(t ‖ l)ᵃᵇ. The app can compute H(t ‖ l)ᵇᵃ and because H(t ‖ l)ᵃᵇ = H(t ‖ l)ᵇᵃ, matching location trails can be found by comparing the list of location trails.

On page 10 an interesting remark is made: “The server can limit the exchanges with any client to N points per exchange, and limit the number of queries per day. This limitation is important for preventing privacy attacks where an adversary might attempt querying over the entire spatiotemporal grid to reconstruct the location histories of diagnosed carriers.” This in particular, in my opinion, makes the threat of low entropy negligible for this approach. However, the approach puts computational burden on the server. Consider the amount of traffic if 5 million clients connect to the server every 5 minutes (be aware that the data used is made up and not related to the MIT project).

Conclusion

I personally believe that privacy-preserving contact tracing can be implemented. The developers of Stopp Corona App did a good job, however, I still see issues. I focused (as often) on technical details, but usability will be crucial for broad adoption in a population. In general, I believe MIT Safepaths and PEPP-PT are candidates for preserving-preserving contact tracing, but we are waiting for final products.

Resources

Papers / Research article:

News paper articles:

Security / privacy evaluation:

Opinion articles:

White papers:

Projects, implementations and initiatives: