Archive for Large Language Models

Meta Theft

Posted in Art, Books, Television with tags , , , , , , on March 21, 2025 by telescoper

Beware, all thieves and imitators of other people’s labour and talents, of laying your audacious hands upon our work.

Albrecht Dürer, 1511

I’ve remembered that quotation since it was uttered by Inspector Morse in the episode Who Killed Harry Field? Albrecht Dürer wasn’t referring to Artificial Intelligence when he said it, but it does seem pertitent to what’s going on today.

There’s an article in The Atlantic about a huge database of pirated work called LibGen that has been used by Mark Zuckerberg’s corporation Meta to train its artificial intelligence system. Instead of acquiring such materials from publishers – or, Heaven forbid, authors! – they decided simply to steal it. That’s theft on a grand scale: 7.5 million books and 81 million research papers.

The piece provides a link to LibGen so you can search for your own work there. I searched it yesterday and found 137 works by “Peter Coles”. Not all of them are by me, as there are other authors with the same name, but all my books are there, as well as numerous research articles, reviews and other pieces:

I suppose many think I should be flattered that my works are deemed to be of sufficiently high quality to be used to train a large language model, but I’m afraid I don’t see it that way at all. I think, at least for the books, this is simply theft. I understand that there may be a class action in the USA against Meta for this larceny, which I hope succeeds.

I think I should make a few points about copyright and authorship. I am a firm advocate of open access to the scientific literature, so I don’t think research articles should be under copyright. Meta can access them along with everyone else on the planet. It’s not really piracy if it’s free anyways. Although it would be courteous of Meta to acknowledge its sources, lack of courtesy is not the worst of Meta’s areas of misconduct.

In a similar vein, when I started writing this blog back in 2008 I did wonder about copyright. Over the years, quite a lot of my ramblings here have been lifted by journalists, etc. Again a bit of courtesy would have been nice. I did make the decision, however, not to bother about this as (a) it would be too much hassle to chase down every plagiarist and (b) I don’t make money from this site anyway. As far as I’m concerned as soon as I put anything on here it is in the public domain. I haven’t changed that opinion with the advent of ChatGPT etc. Indeed, I am pretty sure that all 7000+ articles from this blog were systematically scraped last year.

Books are, however, in a different category. I have never made a living from writing books, but it is dangerous to the livelihood of those that do to have their work systematically stolen in this way. I understand that there may be a class action in the USA against Meta for this blatant larceny, which I hope succeeds.

The Big Four and Your Work

Posted in Open Access with tags , , , , on September 10, 2024 by telescoper

In Agatha Christie’s novel The Big Four (left) the great detective Hercule Poirot tries to identify the members of sinister group of unscrupulous individuals bent on world domination.

When it comes to the world of academic publishing, the members of the The Big Four are somewhat easier to identify, though no less unscrupulous. They are Elsevier, Spring-Nature, Taylor & Francis, and John Wiley & Sons who have cornered almost 50% of the lucrative market in scholarly books and journals and the eye-watering profits that go with that territory.

Recently, however, these companies have found a new way of boosting their profits still further. This involves selling their “content” to tech companies in order to train the generative AI algorithms known as Large Language Models. The latest to do this is Wiley, which has already cashed in to the tune of $44 million. Wiley has not given its authors the right to opt out of this deal nor will authors be remunerated. Others outside the Big Four are also cashing in. Oxford University Press, for example, which publishes Monthly Notices of the Royal Astronomical Society, has done similar deals.

This sort of arrangement provides yet another reason to avoid the commercial publishing sector. Do we become academic researchers in order to be mere “content creators” for Wiley and the rest?

On Papers Written Using Large Language Models

Posted in Uncategorized with tags , , , , , , , on March 26, 2024 by telescoper

There’s an interesting preprint on arXiv by Andrew Gray entitled ChatGPT “contamination”: estimating the prevalence of LLMs in the scholarly literature that tries to estimate how many research articles there are out there that have been written with the help of Large Language Models (LLMs) such as ChatGPT. The abstract of the paper is:

The use of ChatGPT and similar Large Language Model (LLM) tools in scholarly communication and academic publishing has been widely discussed since they became easily accessible to a general audience in late 2022. This study uses keywords known to be disproportionately present in LLM-generated text to provide an overall estimate for the prevalence of LLM-assisted writing in the scholarly literature. For the publishing year 2023, it is found that several of those keywords show a distinctive and disproportionate increase in their prevalence, individually and in combination. It is estimated that at least 60,000 papers (slightly over 1% of all articles) were LLM-assisted, though this number could be extended and refined by analysis of other characteristics of the papers or by identification of further indicative keywords.

Andrew Gray, arXiv:2403.16887

The method employed to make the estimate involves identifying certain words that LLMs seem to love, of which usage has increased substantially since last year. For example, twice as many papers call something “intricate” nowadays compared to the past; there are also increases in the use of the words “commendable” and “meticulous”.

I found this a commendable paper, which is both meticulous and intricate. I encourage you to read it.

P.S. I did not use ChatGPT to write this blog post.