AI and Copyright: Courts Partially Favor Meta and Anthropic

Facing Anthropic and Meta, the authors will have to adjust their angle of attack. Or at least wait a little longer.

Accused of illegally exploiting copyrighted works to train large language models, the two companies saw parts of their practices validated this week by first-instance rulings in the United States.

The ruling concerning Anthropic came down on June 23. Originally, it was a complaint filed by three authors in the summer of 2024. It covers two major aspects. On one hand, the downloading of millions of pirated eBooks from collections such as LibGen (Library Genesis), PiLiMi (Pirate Library Mirror), and Books3 (a portion of the ThePile dataset assembled by EleutherAI). On the other, the purchase of physical books that were subsequently digitized.

A “great library” built from physical books… and pirated books

This digitization drive had been launched in early 2024 amid questions about the origin of the electronic books involved. Anthropic had hired the former head of partnerships at Google Books. He initially approached two publishing houses to obtain a license specifically for training AI models, but ultimately opted to purchase physical books in bulk from distributors.

All of these contents allowed the creation of a “central library.” The works Anthropic drew from it to train its LLMs were transformed mainly in four ways:

  • Selection from the library, then creating a copy for the training dataset
  • Cleaning (removal of elements such as headers and footers)
  • Tokenization
  • Storage in a “compressed” form within the trained LLMs

Whether the models could or could not reproduce fragments of the books used to train them, Anthropic exposed them to the public with various content filters. Sufficient, according to the judge, to avoid substantial plagiarism. This aligns, he notes, with the limited amount of text visible on Google Books.

Anthropic judged within the bounds of fair use

The authors did not sue over the outputs: they focused on inputs. Anthropic had argued, in particular, that its exploitation of copyrighted works was “reasonably necessary” to train its LLMs.

The judge nonetheless deemed the practice to fall under a fair use, especially given its transformative nature: LLMs are not trained to replicate or supplant works, but to “create something different.”

Digitization is also a form of transformative use, though narrower, the judge said. It facilitated the storage of books and the use of search tools—two elements not tied to “creative properties.” Moreover, Anthropic burned the physical books after scanning and did not disseminate the digital versions. This slightly tips the balance in favor of fair use.

Downloading pirated books, by contrast, is “inherently illegal.” This holds even when the use is transformative and followed by immediate deletion. Anthropic aggravated its position by retaining the material after training. Hundreds of researchers had access to it… and did indeed use it.

The authors also pointed to another factor traditionally considered in fair use assessments: the risk of substitution (dilution of the value of the works concerned or their market potential). The judge found that the forms of competition involved did not fall within the scope of the United States Copyright Act.

Meta wins… on the outputs

The ruling partially in Meta’s favor came down on June 25.

In the summer of 2023, a few months after the launch of the LLaMA models, authors had challenged the American group. Their main grievance resembled the one raised against Anthropic: training LLMs with copyrighted works, without consent or fair compensation.

In describing the training dataset for the models in question, Meta cites 85 GB of data from a “books” category. This rests on two sources. On one side, Project Gutenberg, which collects works in the public domain. On the other, Books3. This set—derived from ThePile—derives from a copy of content from the Bibliotik tracker. In other words, according to the plaintiffs, from a “shadow library” that also includes LibGen. It contains around 200,000 books.

In parallel with the Anthropic case, the judge held that using copyrighted works without consent or compensation is “in most cases” illegal. He added that fair use typically does not apply to a copy that significantly reduces the “potential market” of a work.

The plaintiffs had precisely argued this potential-market reduction, specifically their ability to license works for AI training.

This argument does not hold, the judge says: the plaintiffs failed to prove how current or anticipated outputs would dilute the market. At the same time, he rejected the idea that the LLaMA models could reproduce excerpts sufficiently significant from the books used to train them.

Since this is not a class action, the ruling’s consequences are limited. And it should be reminded that this does not mean that using copyrighted works to train LLMs is legal. The plaintiffs simply failed to advance the right arguments…

On the same topic

See all Data & AI articles

Low-code platforms: behind AI, the marketing question

By
Clément Bohic

10 min.

Alibaba abandons “hybrid thinking” for its Qwen LLMs

By
Clément Bohic

AGNTCY to the Linux Foundation: what is this Internet project about […]

By
Clément Bohic

Variable naming, a factor influencing code assistants

By
Clément Bohic

GenAI, explored but not widely deployed for managing microservices

By
Clément Bohic

Dawn Liphardt

Dawn Liphardt

I'm Dawn Liphardt, the founder and lead writer of this publication. With a background in philosophy and a deep interest in the social impact of technology, I started this platform to explore how innovation shapes — and sometimes disrupts — the world we live in. My work focuses on critical, human-centered storytelling at the frontier of artificial intelligence and emerging tech.