French lawmakers propose new copyright law about generative AI

Via the wonderfully named Techo Llama site

On September 12, several French lawmakers from the Assemblée nationale presented a law proposal to the Presidency which has the objective of reform some norms in existing copyright law (text of the proposal here). This is a rather interesting development that may prove to be controversial given some of the provisions in the law, but more on that a bit later.

Firstly, if the proposed law passes it is not likely to survive in its present form, it is at the very beginning of a lengthy parliamentary process that could take months. The Presidency will assign the proposal to one of the permanent parliamentary commissions, possibly the commission for cultural affairs (Commission des affaires culturelles et de l’éducation), but at the time of writing it doesn’t appear to have been assigned yet, and it is not in the agenda of any of the 8 permanent bodies. The designated commission will review the law. During this review, they may call upon expert evidence. After deliberation, the commission can make amendments or even decide not to forward the law to the plenary. If approved, the law will be examined article by article in the Assembly and subsequently voted upon. If it passes, it will be sent to the Senate for further discussion and voting within its commission. Should the Senate suggest modifications, the law will be returned to the initial Assembly for further deliberation until a consensus is reached on the text. Once agreed upon, the law will be forwarded to the President for promulgation. Alternatively, the President may refer it to the Constitutional Council for consultation.

So with the understanding that this is likely not to be the final text of the law, what does the proposal contain?

Preamble

The preamble outlines the objective of the law, which is to “protect authors and artists of creation and interpretation based on a humanist principle, in legal harmony with the Intellectual Property Code.” So far, so good. However, the preamble’s sole example of AI creation is the following: “For instance, in 2016, a painting titled “The Next Rembrandt” was designed by a computer and produced using a 3D printer, 351 years after the painter’s death.”

I must confess, my heart sank slightly upon reading this. It’s astonishing that a portrait from 7 years ago is still referenced as cutting-edge, especially in the preamble of new legislation. While I did reference “The Next Rembrandt” in a 2017 journal article and a subsequent piece in WIPO Magazine, those publications are now 6 years old. In the realm of AI, that’s an eternity. One would anticipate more contemporary examples. May I also take this opportunity to advocate for ceasing the mention of both “The Next Rembrandt” and “Edmond de Belamy” in discussions about AI?

Article 1

The actual text of the proposal contains only 4 articles which act as amendments to the existing law. Article 1 adds a paragraph at the end of Article L131-3 of the IP Law, which deals with the transfer of the rights of the author (all translations from DeepL):

“The integration by artificial intelligence software of intellectual works protected by copyright in its system, and a fortiori their exploitation, is subject to the general provisions of this code and therefore to authorisation by the authors or right holders”.

One could argue that this was not needed, as it would be considered to be implicitly a part of existing rights of the author, but this article reinforces such an assumption. So any inclusion of an existing work into an AI system requires the authorisation of the author. It’s not clear how this will conflict with the existing provisions regarding TDM contained already in French law in Art L122-5-3, the proposal does not mention exceptions or limitations, and this is likely to be the subject of some debate. Will the law contradict existing European provisions, or will it treat AI and TDM as separate regimes? We don’t know, but I think that this could be something that could be the subject of a consultation with the Constitutional Council, and perhaps even a referral at some point the the Court of Justice of the European Union (I do not know if that is possible at this stage).

Article 2

Article 2 modifies Art L321-2, which handles collective management organisations. For those unfamiliar with this term, these are organisations which handle rights on behalf of rightsholders in a collective manner, such as PRS for Music, or ASCAP. From this article it seems like this whole legislation is a land-grab by collective societies, and one that may even make some of the most ardent opponents of AI flinch. The first part of the article reads:

“When the work is created by artificial intelligence without direct human intervention, the only rights holders are the authors or assignees of the works that made it possible to conceive the said artificial work.”

This paragraph is quite astounding, and I believe its implications extend far beyond the drafters’ intentions. Firstly, the term “work” in copyright is potent. In French, “l’œuvre” is reserved for copyrighted works throughout the act. Its inclusion here implicitly admits that AI outputs are copyrighted works, offering them protection. This surpasses even UK authorship provisions, but I digress. The second notable point is the definition of AI works created “without direct human intervention.” This phrasing is peculiar since nearly all works involve some human intervention at some stage. The most challenging aspect of this paragraph is its third element: it assigns ownership of the work (now protected by copyright) to the authors or assignees of the works that enabled the creation of the said artificial work.

This is the primary issue I see with the proposed legislation. How can one determine the author of the works that facilitated the conception of the AI-generated piece? While it might seem straightforward if AI works are viewed as collages or summaries of existing copyrighted works, this is far from the reality. As of now, I’m unaware of any method to extract specific text from ChatGPT or an image from Midjourney and enumerate all the works that contributed to its creation. That’s not how these models operate.

One potential solution is to consider the author of every input used in training as a co-author of the work. This aligns with the notion that every AI output is a derivative of all its inputs. While I find this theory lacking in logical rigour, let’s entertain it for the sake of this proposed law.

Considering ChatGPT and literary works, where most litigation currently occurs, GPT-3 was trained on 499 billion tokens. A token in this model is a sequence of characters frequently found together in the training text corpus, often comprising 4 letters or numbers. For instance, the digits 1234567890 consist of the tokens 123, 45, 678, and 90. Thus, a word isn’t necessarily a token; on average, a word likely comprises 2 tokens or more. Using a token calculator, one can estimate the token count of a text based on its word count. For instance, this blog post contains 2,507 words, translating to 3,343 tokens. Let’s now calculate my potential ownership stake in every GPT-3 output.

Over the years, I’ve penned a significant amount of content. For this post, I calculated the potential volume of my work in GPT’s training data. This blog contains 836,675 words (not counting this blog post), equivalent to 1,115,567 tokens. My publicly available articles and book chapters total 636,196 words, and my two published books (also freely accessible online) comprise 192,332 words. In total, I’ve potentially written 1,655,205 words online, or 2,206,940 tokens. Given GPT-3’s 499 billion tokens and assuming all my works were included in its training (which I cannot confirm), my token share in GPT-3 would be 0.0004%. Thus, every time someone uses ChatGPT version 3, this proposed law would mandate OpenAI to compensate me 0.0004%. However, GPT-4 was trained with 13 trillion tokens, reducing my share to a mere 0.00001%.

Perhaps there’s no need to open a bottle of champagne yet.

For comparison, The Wheel of Time series spans 6,533,382 tokens (4,900,036 words), constituting 0.00005% of GPT-4. A Song of Ice and Fire comprises 2,314,739 tokens (1,736,054 words), or 0.00001%. The Harry Potter series has 1,445,560 tokens (1,084,170 words), also 0.00001%. In a litigation against OpenAI, one book was 72,000 words long, or 96,000 tokens, which seems to be 0.00000007% of GPT-4’s entirety (I’d appreciate a verification of my calculations).

These figures only account for raw token numbers. Any ownership calculation must also factor in that not all sources equally influence the final trained model. While the sources for the latest versions remain undisclosed, we do know GPT-3’s sources and their respective weights:

As I mentioned in previous posts, Common Crawl is a large online dataset of web crawled text; OpenWebText2 includes text from Reddit discussions; and we all know Wikipedia. Everyone seems to agree that Books1 is a dataset of public domain works collected by Project Gutenberg, and it accounts for 8% of the total training data weights, the remaining 8% comes from Books2, and we don’t know what is in that, but everyone seems to assume that all of the books in the dataset, from Stephen King to Gabriel García Márquez, are to be found here. As you can see, both Wikipedia and WebText2 has a disproportionate weighting, way higher than Books2, so prolific Reddit users could have a much higher claim to copyright ownership of any output than George R R Martin and Scott Turow, just to randomly name a couple of authors. I won’t even try to make the calculations of comparative token weights from each source, perhaps someone with better maths skills than me can help?

And that’s before GPT-4, which has 13 trillion tokens. Who knows what are the weights on those? So can we all agree that the percentage allocation of everyone in the inputs would be unworkable?

I won’t go in detail into the rest of Article 2 as it develops all of the premises present in the first paragraph, namely that the work has copyright, and that the authors that contributed to the work can both be identified, and they register to receive royalties from a collective society. It also presupposes that works generated by AI will result in remuneration, which is another wildly speculative assumption, as there’s a different economy arising from the use of AI in the creative industries that does not rely on copyright.

Article 3

The first part of the article is probably the least controversial of all of the articles, and one that will probably be included as part of the AI Act anyway. The article amends Art L121-2, which deals with the exclusive right of publication of a work (droit de divulgation). The addition reads:

“If a work has been generated by an artificial intelligence system, it is imperative to include the words: “work generated by AI” and to insert the names of the authors of the works that led to the creation of such a work”.

The main controversy about this is related to Article 2, and it is the fact that the above assumes that AI works have copyright, otherwise why include it in a section reserved for one of the most important exclusive rights of the author? And of course, as mentioned above, how can one mention the works that led to the creation of an individual output out of billions of individual works? Perhaps if someone used an artist in a prompt one could argue that this artist is the author, but again, this is not how AI works.

The other thing is that there is no indication as to what would happen if the above is not met. By including it as a requirement to exercise the exclusive right of publication, then one would have to assume that the work would not be able to enjoy the benefits of such a right? There’s some lack of clarity as to how this would work in practice.

Article 4

This may seem like a throwaway article, but this is the real land-grab by collective societies. This article also adds to the existing Art L121-2 (publication right), and it says:

“Furthermore, in the event that a work of the mind is generated by an artificial intelligence system from works whose origin cannot be determined, a tax intended to enhance the value of creation is introduced for the benefit of the organisation responsible for collective management designated by article L. 131-3 as amended of this code.

“This tax is imposed on the company that operates the artificial intelligence system used to generate the said “artificial work”.

“A Conseil d’Etat decree sets the rate and basis of this tax”.

As you can see from the analysis of Article 2, allocating an author may be either impractical or impossible. So in my view, this is the real intention of the legislation. As it will be not possible to find the author of an AI work (which remember, has copyright and therefore isn’t in the public domain), the law will place a tax on the company that operates the service. So it’s sort of in the public domain, but it’s taxed, and the tax will be paid by OpenAI, Google, Midjourney, StabilityAI, etc. But also by any open source operator and other AI providers (Hugginface, etc). And the tax will be used to fund the collective societies in France… so unless people are willing to join these societies from abroad, they will get nothing, and these bodies will reap the rewards.And remember, the idea behind the tax is predicated on outputs being able to make money, which is not at all guaranteed.

I know what you’re thinking… Great! Let them pay! But what will likely happen is that all of the companies will cease to provide services in France, much in line with what happened in Canada with Canadian media, and French AI creators will turn to open source models in the wild with no curator or institution guarding it, and these will not pay any tax. That, or VPNs will become very popular.

So we’re back to square one.

Edited to add: The tax however could work better with a few tweaks and changes. Instead of including this in the right of publication, add a levy as a requirement to the TDM exception for commercial uses contained in L122-5-3, much in line with existing levies for personal copies in physical media. Problem solved.

Concluding

This law is unprecedented globally, and it evokes memories of HADOPI for me. For those unfamiliar or too young to recall, HADOPI was a French law enacted in 2009 that was one of the first to introduce what would later be termed a “three strikes and you’re out” copyright system. If this doesn’t ring a bell, don’t blame yourself; it was a flop. The law established an agency responsible for notifying ISPs when someone utilized their network for pirating music or videos. The idea was that repeat offenders would eventually lose internet access. While it sounded promising on paper, in practice, it was unfeasible, and not a single individual faced this penalty, and that part was later scrapped because it had failed to live up to its objectives.

As it stands, the proposed law seems impractical and appears to set up a tax that most creators will never benefit from. However, there are some redeemable aspects. I’d prefer a revision of the provision declaring AI works as copyrighted, though I suspect this might be the first section of the legislation to be discarded.

But, who can predict the future? Perhaps France will embrace s9(3) CDPA. One can only dream.

Source:

French lawmakers propose new copyright law about generative AI