Category Theory
Zulip Server
Archive

You're reading the public-facing archive of the Category Theory Zulip server.
To join the server you need an invite. Anybody can get an invite by contacting Matteo Capucci at name dot surname at gmail dot com.
For all things related to this archive refer to the same person.

Stream: community: discussion

Topic: backing up the arXiv

John Baez (Mar 13 2025 at 23:51):

It seems unlikely that the US government would demand that Cornell destroy the arXiv. But the many copies of the arXiv around the world were shut down on September 15, 2024 and it seems that now all the data resides in the US.

If anyone wants to back up the arXiv, they should go here:

arXiv Bulk Data Access - Amazon S3.

The complete arXiv was 2.7 terabytes in March 2023, and growing at about 100 gigabytes a month.

Madeleine Birchfield (Mar 15 2025 at 05:33):

While it may seem likely that the US government would demand that Cornell destroy the arXiv, enough cuts to funding Cornell may result in Cornell no longer being able to afford maintaining the arXiv.

Ryan Wisnesky (Mar 15 2025 at 17:39):

the internet seems to be saying that it costs about $100 in AWS fees to download the arxiv. It's a user pays for download bandwidth model.

John Baez (Mar 15 2025 at 20:07):

When I raised this issue on Mastodon, someone did a little calculation estimating $720 for that, but there is also a cheaper way, it seems.

John Baez (Mar 18 2025 at 18:27):

Some outside the US have offered to set up a mirror of the arXiv, but when the system of mirror websites was taken down on Sept. 15th, the arXiv switched to using the cloud (Amazon) in a way that might make it harder to restart the mirror system - I'm not sure.

Morgan Rogers (he/him) (Mar 19 2025 at 09:01):

:grimacing: another community resource entrusted to Amazon? Eek...

John Baez (Mar 19 2025 at 18:16):

Yes. And the arXiv also has policies that make it difficult to legally copy it and distribute the articles! They write:

Note: Most articles submitted to arXiv are submitted with the default arXiv license, which grants arXiv a perpetual, non-exclusive license to distribute the article, but does not assign copyright to arXiv, nor grant arXiv the right to grant any specific rights to others. We are thus unable to grant others the right to distribute arXiv articles. If you build indexes or tools based on the full-text, you must link back to arXiv for downloads. A small fraction of submissions are made with other licenses and this information is available in the OAI-PMH metadata.

(Emphasis mine.)

John Baez (Mar 19 2025 at 18:22):

This makes it easier for bad actors to knock out the arXiv at a single location. If this happened, the good guys might have to download the arXiv papers from the Internet Archive and build a new interface.

I don't know if Anna's archive has copied the arXiv. I'm becoming more and more sympathetic to this pirate organization.

Kevin Carlson (Mar 19 2025 at 18:22):

That's so disappointing, since it seems like probably an unintended consequence.

John Baez (Mar 19 2025 at 18:29):

Yes, I think that decision was based on optimistic assumptions like 1) the arXiv administration would never become corrupt and abuse their monopoly power, and 2) the US would not be taken over by science deniers who want to defund academia. It turns out 2) happened before 1) even got a chance to happen! Of course this doesn't imply the arXiv is doomed, but (like many things) it's looking more fragile these days.

Evan Patterson (Mar 19 2025 at 18:48):

I don't know the actual history, but I assume that the default arXiv license is what it is for a clear reason, namely that the academic publishers traditionally take ownership of the copyright of the articles they publish and then enforce strict rules about copying and redistributing the article. arXiv has given itself the minimal rights needed to host and distribute the articles on their server. If the standard license gave arXiv more rights, then the academic publishers might object, which would in turn make researchers less likely to put their articles on arXiv since, after all, career advancement depends on publishing in journals not on preprint servers.

Evan Patterson (Mar 19 2025 at 18:49):

So, as usual, all the blame for why academic publishing is a trainwreck lies with the commercial publishing houses :)

Nathanael Arkor (Mar 19 2025 at 19:42):

For what it's worth, it's typically possible not to give away copyright to publishers, but they're often not very clear about this, and strongly encourage you to transfer copyright.

John Baez (Mar 19 2025 at 20:05):

The arXiv does allow authors to choose a Creative Commons copyright that makes it easy for people to use their works. But it does not "push" this option... and yes, if they did there would be howls of protest from the publishers.

Ryan Wisnesky (Mar 21 2025 at 02:49):

if anyone does end up downloading it and is willing to copy it to SSDs sent to them, I'd be happy to keep a copy

Eric M Downes (Mar 25 2025 at 08:23):

General point: May I recommend Hetzner as a pretty cheap N EU alternative for those concerned with amazon. Even if you aren't storing it there but running the download, they have S3-compatible storage, so it will behave like AWS in some sense.
https://www.hetzner.com/storage/object-storage/

But if one were to do this, which of course is illegal and I am only saying this for entertainment purposes, then distributed uploading to libgen (which together with Z-library backs annas archive) is probably more effective than storing SSDs in a fallout shelter (but I mean, do what you feel called to do!)
https://wiki.mhut.org/content:how_to_upload
it appears libgen does not have a complete arxiv upload (I dont know how to see what journals Z library has)
https://www.libgen.is/scimag/journals/48611

if one were to do such a thing, then using tor and bit-torrent recommended.

If you found this entertaining and would like to engage further in mutual entertainment, you can DM me. :)

Eric M Downes (Mar 25 2025 at 09:32):

Oh and look someone has even started on this: https://github.com/sooper-seekrit/arxiv-to-libgen

deficient minds think alike I suppose!

John Baez (Mar 25 2025 at 14:49):

My gosh! I hope they're actually going to do it.

By the way, the Azimuth Backup Project used Hetzner to temporarily store US govermment climate data back in the first Trump admimistration. Then we moved the data to a permanent location.