Intertextuality

When does artisanal reading become industrial data processing?

Nov 25, 2024

Picture, if you will, a robot sitting quietly at a table in a library, reading intently. Imagine that it is the Library of Congress, and the robot is methodically reading—and thereby learning from—all 175 million of its catalogued items. Should this be illegal? Must the robot be stopped?

That remains one of the fundamental unanswered social questions about LLMs. Previously, historically, we have accepted that the moment a work is published, anyone can read and learn from it. We control the copying and reproduction of works—hence ‘copyright’—but not reading them or absorbing their contents. I may not be particularly happy if tankies or neo-Nazis read my novels, or this Substack, but I readily concede that they have every right to do so, and if they’re reading a library copy, they owe me nothing.

There is a great and furious desire among writers, and especially artists, to change this legal regime so as to exclude LLMs. In fact many would claim that it is obviously immoral theft for a neural network to absorb previously published words and images, and use this absorption to better create new works.

It’s difficult to convey just how unobvious this seems to me. We have never ever before attempted to legally restrict the reading, rather than the copying, of published work, and it seems to me that it’s not just practically but morally wrong to start doing so. It’s certainly not obvious that it’s some kind of moral imperative to ban our hypothetical library robot even if, or just because, it is owned by an evil tech company.

Of course assembling the library in the first place is a different matter. Libraries and Creative Commons licenses are noncommercial; if OpenAI wants to train their LLMs on my novel Exadelic, they are both legally and morally obligated to buy a copy. But that ends their legal obligation to me … and that obligation is not really what upsets anyone. If OpenAI wants to buy one copy each of 10 million copyrighted books, adding up to a trillion tokens, they will have to spend maybe $200 million—a tiny fraction of their funding and revenue—and authors will in turn receive a couple bucks per book. Nobody on either side cares that much about building the library.

It’s OpenAI’s projected $11 billion in revenue next year, and more yet in the years to come, that so infuriates writers and artists who feel that this revenue is predicated on, and exploits, their previous work. (Although it seems worth noting that many of the same people irate about LLM training data being used by giant corporations were until fairly recently also irate that copyright law was absurdly extended in the 1990s at the behest of, you guessed it, giant corporations.)

“Predicated on” is perhaps true, with the huge caveat that each author’s contribution is truly miniscule. Llama 3 was trained on 15 trillion tokens; a 300-page book contains maybe 100,000 tokens; thus a single book would be a hundred-and-fifty-millionth of Llama’s training dataset, literally a droplet in a swimming pool. If we constructed some sort of ASCAP-like arrangement in which creators were assigned 10% of OpenAI’s revenue as royalties, then their $11 billion in revenue would turn into $1.1 billion for authors, which would mean … about sixty cents a month, per book, across the corpus. Especially prolific authors would be able to go to Subway once a month on OpenAI’s dime. We’re not talking about life-changing wealth here.

(Obviously LLMs actually violating copyright by generating truly derivative works is a different matter; but, especially for written output, this is extremely rare, really very difficult, and ~nobody actually uses them to do so. The fact it can be theoretically possible to use them to violate copyright does not imply their use must be legally restricted any more than it does for printers and photocopiers.)

I think that kind of revenue sharing would be desirable and socially beneficial. But it also seems fairly clear to me, while admittedly IANAL, that it is well outside the remit of current copyright law, without even addressing what happens with LLMs trained in more laxy copyright regimes. Angry claims that all generative AI output is obviously a form of theft which requires immediate redress seems to me to be little more than delusional magical thinking. We don’t currently restrict reading, so our hypothetical robot, whether it is an open-source robot guided by a socialist collective or one wholly owned by an evil multinational tech corporation, currently has every right to read—and learn from—all the books in our hypothetical library.

Should future laws ban robots reading copyrighted works, even when they don’t actually violate copyright? A lot of authors seem to think that the answer is yes, which I find deeply sad. Should we share some of the robot’s revenues with the creators whose work it read / observed / absorbed? I think we should. (Although, again, let’s not pretend it would be life-changing.) But for that we’ll need entirely new laws, and it seems useless and counterproductive to pretend otherwise. While we’re at it, maybe we could talk about loosening copyright laws back to, say, life of the author plus 50 years. Information may not “want to be free,” but it should collectively benefit us all. Restricting its flow in new ways, no matter how strong our knee-jerk impulses may be, does the exact opposite.

Gradient Ascendant

Discussion about this post