The trouble with arXiv
We’re now publishing papers at a steady rate at the Open Journal of Astrophysics. This is probably not obvious to outsiders, but our platform actually consists of two different sites, one handling submissions and the other dealing with publishing those papers accepted. Although we have a large (and still expanding) team of volunteer Editors to deal with the former, as Managing Editor I am the only person with the keys to the publishing side of things. This part of the process has been simplified enormously after the automation introduced earlier this year but it still takes some time to do, as I have to check the overlay and metadata before pressing the button to deposit everything with Crossref and make the overlay live. I also announce each paper on social media. This usually takes around 15 minutes per paper, give or take.
Now that I’ve returned to full teaching duties at Maynooth University, I’ve developed a routine to deal with this activity. During workdays I usually wake around 7am, make some coffee, and then check the day’s arXiv mailing to see if any of our accepted papers have been announced. If any have, I do the honours while I have my coffee, and then proceed to shower and breakfast (including Coffee no. 2); if none have, I go straight to shower and breakfast. I’ve been following this routine for quite a while now.
In the last couple of weeks, however, I have noticed quite often when I try to look up newly-announced papers on arXiv that the connection times out with a message saying ‘rate exceeded’. If that happens I just wait a while and try again. It’s not a very serious issue but it does slow down the process.
Well, today I found out the reason via a message on Mastodon. The loading errors at arXiv are caused by people doing many simultaneous downloads in attempts to scrape all the content from arXiv as soon as it is announced. This is almost certainly to provide material for Large Language Models, such as ChatGPT, which are essentially Automated Plagiarism Engines. I propose the acronym APE for the kind of person who engages in this sort of activity.
This is a very tedious development and I hope arXiv can find a way of putting a stop to it without inconveniencing its authentic users. I suggest that the people managing arXiv identify the culprits and send the boys round.
October 22, 2024 at 10:51 am
I used to use Netscape and detractors called it Netscrape…
April 24, 2025 at 8:49 pm
[…] It seems that arXiv is going to be moved from local infrastructure at Cornell University to some sort of Google Cloud Platform. I’m not sure what to make of this move. For one thing, I’m deeply suspicious of Google so I hope that measures will be taken to ensure that arXiv remains freely accessible to the global scientific community. I suspect too that Google will use arXiv submissions as it uses everything placed in its control, to train AI. On the other hand, everything on arXiv is currently in the public domain anyway, and there has been evidence of attempts by bots to scrape its content already, causing a (temporary) degradation of service. […]