Copyright and generative AI training under U.S. and european's law

COPYRIGHT AND GENERATIVE AI TRAINING UNDER U.S. AND EUROPEAN'S LAW

With every passing second technology advances at a faster pace, and laws must adapt to new and constantly changing realities.

All intelligence requires prior learning and Artificial Intelligence (hereinafter "AI") is not an exception of the rule. To this end, it is known that massive amounts of works of art belonging to their authors' and/or copyright owners are often used, damaging creators' interests, as the U.S. Copyright Alliance, warned:

"The widespread unauthorized ingestion of copyrighted works would certainly appear to cause immeasurable harm to creators and copyright owners—both by destroying existing, nascent, and to-be-developed licensing markets and by flooding the market with low-quality substitutional material."

(Copyright Alliance Initial Comments at 51, Report of the Register of Copyrights, Part 3, Pre-Publication Versions, United States Copyright Office, pg. 33)

However, on the other side, a complete restriction of AI would threaten innovation and progress, as the venture capital firm A16Z initially commented:

"Imposing the cost of actual or potential copyright liability on the creators of AI models will either kill or significantly hamper their development. The result will be far less competition, far less innovation, and very likely the loss of the United States' position as the leader in global AI development."

(a16z Initial Comments at 8, Report of the Register of Copyrights, Part 3, Pre-Publication Versions, United States Copyright Office, pg. 34)

A balance between both parties' interests must be made. The aim of this article is to analyse how the uncertainty surrounding the inputs and resources given to generative AI's is being addressed in the United States and the European Union, when no explicit authorization from the copyrights' owner is expressly given for AI's training based on copyrighted works.

United States

The outstanding case between "The New York Times Company v. Microsoft Corporation, Open AI, Inc. et all (case 1:23-cv-11195)" is a clear example where The NY Times Co., claims exclusive rights on their works (articles, reports, reviews, etc.) as copyrights' owner against Microsoft Corporation and Open AI, Inc. et all for the following acts at some of their Generative AI, as ChatGPT:

(i) building training datasets out of non-authorized copies;

(ii) storing, processing and reproducing those training datasets containing millions of non-authorised copies and;

(iii) disseminating generative outputs containing copies and derivatives of the non-authorized copies.

Under this scenario, some of the biggest AI producers in the US are trying to claim fair use as legal defence, a judge-made law codified in Section 107 of the 1976 Copyright Act that excludes liability on the non-authorized use based on the assessment of the following four elements:

the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
the nature of the copyrighted work;
the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
the effect of the use upon the potential market for or value of the copyrighted work.

This leads us to a question: does this exception really apply in relation to the inputs made to train the AI?

Courts are beginning to shed some light on this issue, and two orders have been recently delivered: Bartz et al. v. Anthropic PBC, No. 3:24-cv-05417 (N.D. Cal. 23^rd June 2025) and Kadrey v. Meta Platforms, Inc, No. 3:23-cv-03417 (N.D. Cal. 25^th June 2025).

In Bartz case, the Court made an important distinction between inputs of acquired books or pirated books. The US District Court, Northern District of California, finally concluded that training an AI with acquired books and the conversion into a digital format is fair use, while those of pirated copies cannot be deemed as fair use since the demand of the author's books could be displaced destroying the publishing market.

Different arguments were provided nonetheless in Kadrey's case, where the Court considered that the plaintiff failed "to present meaningful evidence on the effect of training LLMs like Llama with their books on the market for those books".

In other words, even if the rightsholders claim was dismissed, the judge affirmed that a more solid defence could let to another conclusion, if authors prove that the outputs of the previous trained AI are capable to compete with the original works, resulting in a market dilution, only possible because of those inputs, hence excluding faire use.

Anyway, what we can confirm for now is that US fair use defence is mainly questioned by the courts because of the dubious compliance of the AI inputs with the 4^th element on the effect of the use with regards to the copyrighted work, since it may result in a significant potential harm of the copyrighted work and their owners, either by the lost of sales, lost of licensing opportunities or the dilution in the market.

European Union

Should we close up on how the EU is assessing this matter, the judicial assessments are much more limited by the law, since Europe has a numerus clausus list of exceptions available, a system way more rigid than fair use's doctrine.

It must be said that it does not exist yet any legal disposition that completely addresses this matter. However, AI creators are holding on the EU Directive on copyright and related rights in the Digital Single Market which provides a statement that could enable them to use copyrighted works:

1. Member States shall provide for an exception or limitation to the rights provided for in Article 5(a) and Article 7(1) of Directive 96/9/EC, Article 2 of Directive 2001/29/EC, Article 4(1)(a) and (b) of Directive 2009/24/EC and Article 15(1) of this Directive for reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining.

(Article 4 Directive (EU) 2019/790 on copyright and related rights in the Digital Single Market (hereinafter "DSM"))

Nevertheless, its scope is not clear so far. It shall be understood original art works constitute just "texts and data" according to the definitions provided in article 2 of the DSM? Is the creation of a database of art works for the purposes of training an AI an infringement? We still cannot provide any answers, but what is certain is that we are not the only ones who have doubts about it, the national courts do too.

We are currently awaiting the EU Court of Justice judgement on the first preliminary ruling on AI, in case C-250/25 filed on April 3, 2025, which would interpret how the text and data mining exception operates in a case where content from news sites have been used to train the generative AI chatbot 'Gemini'; including the following questions:

1. Must Article 15(1) of Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 [on copyright and related rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC], and Article 3(2) of Directive 2001/29/EC of the European Parliament and of the Council of 22 May 2001 on the harmonisation of certain aspects of copyright and related rights in the information society, be interpreted as meaning that the display, in the responses of an LLM-based chatbot, of a text partially identical to the content of web pages of press publishers, where the length of that text is such that it is already protected under Article 15 of Directive 2019/790, constitutes an instance of communication to the public? If the answer to that question is in the affirmative, does the fact that [the responses in question are] the result of a process in which the chatbot merely predicts the next word on the basis of observed patterns have any relevance?

2. Must Article 15(1) of Directive 2019/790 and Article 2 of Directive 2001/29 be interpreted as meaning that the process of training an LLM-based chatbot constitutes an instance of reproduction, where that LLM is built on the basis of the observation and matching of patterns, making it possible for the model to learn to recognise linguistic patterns?

3. If the answer to the second question referred is in the affirmative, does such reproduction of lawfully accessible works fall within the exception provided for in Article 4 of Directive 2019/790, which ensures free use for the purposes of text and data mining?

4. Must Article 15(1) of Directive 2019/790 and Article 2 of Directive 2001/29 be interpreted as meaning that, where a user gives an LLM-based chatbot an instruction which matches the text contained in a press publication, or which refers to that text, and the chatbot then generates its response based on the instruction given by the user, the fact that, in that response, part or all of the content of a press publication is displayed constitutes an instance of reproduction on the part of the chatbot service provider?

(Summary of the request for a preliminary ruling pursuant to Article 98(1) of the Rules of Procedure of the Court of Justice, Case C-250/25, Like Company v. Google Ireland)

We are now expecting how this challenging subject is going to evolve.