Update 2022-03-07: add total size information
Motivation
If you want to write software working with Teχ files, you need a large corpus of files to test against. An obvious choice is arxiv.org; a preprint server which allows authors to also upload their pdftex files.
arxiv documents how to access their files in the AWS S3 requester-pays bucket. So let’s do this!
AWS setup
Prerequirements: you need to have a credit card which allows $1 verification transactions. I have no idea, which one support this process. My VISA card did not work. Furthermore you need to add and verify a phone number.
-
Create an Amazon account on amazon.com.
-
Go to AWS S3 and log in with your Amazon account as root user (we will create an IAM user later).
-
Go to the billing settings (also accessible via top-right clicking on your name and then “Billing Dashboard”). Add the credit card. Then you need to add all data and then pass a verification process.
-
Go to the security credentials (also accessible via top-right clicking on your name and then “Security credentials”).
-
Go to “Access management” / Users / “Add User”. Pick a username for your new IAM user and select “Access key - Programmatic access”. No needs for groups or tags. Confirm your choices, because we will add permissions in the next step.
-
Again go to “Access management” / Users. Choose your user, switch to the “Permissions” tab and add an inline policy. Instead of the Visual editor, you can choose JSON input and add (these permissions are very relaxed, you can adjust it accordingly) …
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": ["s3:*"], "Resource": "*" }] }
-
Go to “Access management” / Users. Choose your user, switch to the “Security credentials” tab and “Create an access key”. Remember the “Access key ID” and the “Secret access ID”.
s3cmd
We are going to use s3cmd. You can install it via pip:
-
python3 -m venv venv; source venv/bin/activate; pip install s3cmd
-
Run
s3cmd --configure
to configure your connection settings. Here, you will need the two keys from of the IAM user. I preferred the HTTP-only version. Before it finishes, it will run a test and should report that a test connection worked out fine. -
Run
s3cmd get --recursive --skip-existing s3://arxiv/src/ --requester-pays
to download all source file from arxiv.org and skip already-downloaded files. I got the command from an StackOverflow thread.
If the account fails with an error message like this …
New settings:
Access Key: MYSECRETACCESSKEY
Secret Key: MYSECRETSECRETKEY
Default Region: AT
S3 Endpoint: s3.amazonaws.com
DNS-style bucket+hostname:port template for accessing a bucket: %(bucket)s.s3.amazonaws.com
Encryption password: MYSECRETPASSWORD
Path to GPG program: /usr/bin/gpg
Use HTTPS protocol: False
HTTP Proxy server name:
HTTP Proxy server port: 0
Test access with supplied credentials? [Y/n] Y
Please wait, attempting to list all buckets...
ERROR: Test failed: 403 (NotSignedUp): Your account is not signed up for the S3 service. You must sign up before you can use S3.
… your IAM user is not properly set up (mobile phone or credit card is not yet verified).
There are 4699 chunks of ~500MB source files as of 2022-03-06 (i.e. 2.3 TB). I only downloaded 2350 chunks (i.e. 50%) which amounts to 1219 GB in total. Each chunk has a size between 1.7 MB and 1.8 GB. Downloading 1 chunk in about one minute in average gave me a 38h download. Don’t waste your time finding out how much the download will cost you. I already did unsuccessfully. I will update this post with costs once I receive my invoice.
Conclusion
Note
|
It is really sad that arxiv only accepts pdftex files. You tend to not write lualatex files in research therefore. |
One could make the cynical comment, that in the past, we used to just download files. These days one needs a mobile phone and a specific credit card to make this work. AWS is a huge complex beast and it took me about 4 hours in total to make this work. This motivated me to document this complex process. However, I understand that making the requester pay for the service makes sense and this has inherent complexity.
So, now I have some corpus to analyze Teχ files syntactically 👍