[Generative AI dossier] - What regulations should govern the design of generative AI ?

Rédigé par Erevan Malroux

 - 

10 March 2025


The spectacular popularity of ChatGPT interface has sparked a series of legal debates, to the point of calling for the adoption of dedicated regulations.

However, various regulations already exist and apply to such systems, even if clarifications are needed.

The purpose of this article is to explore some of the structuring rules (such as copyright and personal data protection) that apply to the design and use of AI systems that generate content such as text or images, independently of the draft European regulation on AI that is currently being drawn up.

Contents :

- ChatGPT: a well-trained smooth talker (1/4)

- How to regulate the design of generative AI (2/4)

- From training to practice: generative AI and its uses (3/4)

- [LINC Exploration] - The work of IAsterix: AI systems put to the test (4/4)

 

"No, it's not too early". This was the answer given by Mira Murati, CTO of OpenAI, to a Time journalist who asked her whether the adoption of state regulation might not slow down innovation in generative AI such as ChatGPT. This response echoes a series of initiatives taken several years ago by a number of actors who have adopted ethical charters for the development of AI.

As is often the case when an industry begins to raise concerns, its actors initially respond by proposing a form of self-regulation. This was notably the case with OpenAI, initially set up as a non-profit, open source association, which made significant moderation efforts when making its ChatGPT system available (for example, by teaching it to refuse to answer certain questions, even though this filter is imperfect).

More recently, a collective self-regulation initiative has emerged, which is potentially more restrictive for actors. But this does not rule out another significant risk, that of standards being defined by too few private actors. Regardless of the efforts made, defining moderation criteria is never neutral, and necessarily reveals different conceptions of the world and different values. To take the case of ChatGPT, which is used by over a hundred million users worldwide, it was an American company with a few hundred employees that defined the categories of unauthorised questions and answers (OpenAI has also revealed that it put even more effort into moderating its new GPT4 model than it did for the GPT3.5 model on which ChatGPT was initially based).

For these reasons, many specialists, including within OpenAI, are calling for the adoption of regulations dedicated to these AI systems, so that the rules are not defined solely by companies. Even more recently, there has been a call for a moratorium on this issue.

However, while these debates are necessary, they must not overshadow the existing rules that already apply to these systems. This article deals with copyright and personal data protection law, which are key regulations for the design of these systems. However, it is not intended to address the question of imputability of any breaches of these rules, nor to detail the issues involved in generating content that is prohibited by law, such as disinformation (these uses are dealt with more fully in article 3 of the dossier).

 

Why the legal loophole is a myth - example of copyright

 

Legally, there is no legal loophole

To speak of a legal loophole is indeed a misuse of language. Indeed, even in cases where there is no specific law or regulation applying to a given situation, general principles always make it possible to deal with it legally, whether to prohibit a practice or to regulate its use. When a practice is not contrary to any specific rule or general legal principle, freedom prevails and it is permitted. A judge hearing a case is obliged to give a ruling, whether or not the law has been adapted, at the risk of committing a denial of justice.  This also applies to AI systems, particularly in France and Europe, where there are already a number of regulations directly related to the development and use of such devices.

In reality, the issue is not the absence of a response in law, but rather the doubt that may exist as to the meaning of this response (it would be more accurate to speak of legal vagueness, particularly in the absence of precedents), or the fact that the response is not deemed satisfactory (for example, when a practice is permitted, but society feels that it should be more strictly regulated). In the latter case, it is appropriate to consider changing the law, for example by adopting specific regulations.

Copyright is undoubtedly one of the most telling examples of the absence of a legal loophole.

Design and use of AI models, the contours of plagiarism

Without going so far as to call it an artistic blur, many commentators have questioned whether the creators of generative AI systems have respected intellectual property law since their mass adoption in recent months, particularly in the case of systems that generate images.

Like most generative AI, this model is based on the application of deep-learning techniques to colossal quantities of data, most often collected on the Internet. This raises the first legal questions: wouldn't the language model have infringed copyright by being built on the basis of protected content? Is there not a risk, when using such systems, of being guilty of plagiarism if the result is too directly 'inspired' by a work used to train the model?

These questions raise major issues. Philosophically, they are almost tantamount to questioning what a work is ("can we talk about a work without a work" would be worthy of a high school diploma subject). Economically, the issue is very concrete, since it raises the question of possible remuneration for work published on the Internet, at a time when the devices that re-use them are becoming increasingly industrialised. Another social issue is the relevance of the rules in this area, since judges will have to decide the question by applying the legal rules currently in force.

Although no judge has yet ruled on the subject, several legal actions have already been brought against AI system designers in several jurisdictions (in particular, image-generating AIs, which are accused of infringing US or UK copyright). This also applies to systems based on language models (see article 1), since texts can also be considered as works protected by copyright (notion of work being potentially very broad, beyond poems, novels and other literary works, it would also include posts on a social network, blog articles, websites, or even computer code posted on a forum).

When it comes to using content to create an AI model, European law provides a relatively precise answer. A 2019 directive introduced into EU law an exception for text and data mining (which is one of the techniques used to design such systems). This exception allows publicly accessible content to be collected without seeking the consent of its authors, including for commercial exploitation, provided that the authors in question have been able to object to this re-use of their works. This right to object, which is still relatively unknown to authors, may raise questions about the way in which it is applied (how is it possible to object, how should the designers of AI systems take such objections into account, will it be just as easy for an independent author or more structured organisations to assert their objection?)

Despite these practical questions, it seems that European law had already anticipated this case, particularly compared with American law, where the application of the "fair use" exception to American copyright is the subject of more debate when it comes to training models based on protected works.

This leaves the question of copyright compliance, not through the training of the model but through the generation of images or text by such systems. Infringement could occur if the result was too directly 'inspired' by a protected work. To determine whether copyright has been infringed, a case-by-case analysis will be necessary.

Here again, the rules that will apply in the event of a dispute have already been determined, although they are not the same everywhere in the world.

In France, it is necessary to check that the original work is a protected creation (i.e. that it is an original work, regardless of its genre, form of expression, merit or purpose) and that it has been reproduced without the author's permission (including in part or by borrowing its original characteristics). These AI systems are designed to learn from phenomenal quantities of content, which should - at least in theory - reduce the risk of their results being likened to plagiarism. However, this risk is not negligible, particularly depending on the user's command prompts (in which case the question of responsibility arises) and the original elements likely to have been learned by the model.

It should be noted that American law also takes into account the originality and creativity of content in order to protect it by copyright. The US Copyright Office recently ruled on the production of a comic strip using image-generating AI (deeming the process insufficiently creative in this case, at least as far as the illustrations generated by the system are concerned, due to its random nature).

While other regulations are potentially in tension with the design or use of such AI systems (e.g. compliance with the general terms and conditions of websites, protection of business secrecy, to which could be added the sanction of parasitism), the issue of copyright is one of the most high-profile and potentially one of the most structuring.

The same applies to the protection of personal data, which is also fully applicable.

Protection of data processed by ChatGPT

Applied concretely to the case of ChatGPT, the challenges of protecting personal data are essentially at two levels:

- The design of the underlying language model, which makes structural use of a large amount of data collected on the Internet;

- The use of ChatGPT system, which processes user data and can re-use it to improve the service or develop new ones. 

The GDPR sets out a number of relevant principles and obligations that apply to these different phases.

 

"Open up to AI" or the sesame of the right of access ?

 

As mentioned above, generative AI models are based on the collection and use of colossal quantities of data (some of which is personal), usually on the Internet. If some of your data is online (on networks such as Twitter or Reddit, for example), it may have been used to train the language model on which ChatGPT is based.

As in the early days of web search engines, your first instinct would be to ask ChatGPT directly if it 'knows' you in its interface.

But as it is a statistical model and not a search engine, its response, even if positive, would not be very useful. Indeed, it would reveal its ability to predict a coherent sequence of letters, words or phrases associated with the sequence of letters that make up your name. Worse still, as this is a probabilistic model, its response might not always be the same, or might even be wrong.

So you won't necessarily learn what he "knows" or what he may have "read" about you, and even less on which websites, since ChatGPT does not cite his sources in his replies.

The GDPR applies to the collection and re-use of personal data, even if this data is publicly accessible on the internet. In particular, it stipulates that such data must be processed lawfully, fairly and transparently.

Transparency is fundamental to the protection of personal data, and in particular to the regulation of AI systems. It requires the party responsible for collecting data on the Internet, annotating it where appropriate, and using it to train an AI model to provide relevant information on the conditions under which the data is processed, and in particular on the sources of the data collection. In the same vein, the GDPR provides for a right of access, which enables any individual to find out whether an organisation is processing data about him or her, to obtain a copy if necessary, and to receive more precise information about the way in which it is processed.

The right of access is an essential component of transparency, and in particular makes it possible to direct or exercise other rights provided for by the GDPR, both with regard to the processing of data that may have been collected online to design the AI model, and with regard to the processing of data provided by each user who uses its conversational interface.

"Have I got a say in how my data is used ?" or exercising other rights under the GDPR

The GDPR does not always require consent from individuals to process their data. Other safeguards exist to ensure the protection of personal data without requiring the consent of each individual. These alternatives also apply to the creation and operation of generative AI systems.

The philosophy of the GDPR is to enable individuals to retain control over their data through a series of rights granted to them. These include the right to object, which allows people to object to an organisation processing their data for a legitimate reason and which can be exercised at any time, as well as the right to erasure or rectification of data.

During the language model design phase, the collection and use of publicly accessible data raises a number of questions about its lawfulness (are the measures and safeguards in place sufficient to dispense with the consent of the data subjects?), or the possibility for each individual to exercise his or her rights over the personal data concerning him or her that would be included in the training database.

With regard to the use phase of ChatGPT, the conversational interface may lead users and the general public to entrust potentially sensitive or confidential data to the system (although the interface itself contains a warning not to provide such data). In this respect, it should be noted that the general terms of use of the free version of ChatGPT provide that OpenAI may re-use user data to improve the system or develop new ones (a mechanism for objecting to such re-use seems to be provided for in this case).

Here again, the GDPR sets out the conditions for re-using data provided by users, through a series of rights granted to individuals, but also through other obligations that the data controller must anticipate.

The AIPD, an analysis that lives up to its name, or the principle of accountability

The GDPR requires all data controllers not only to comply with their obligations, but also to be able to demonstrate their compliance (this is known as the "accountability" principle). Every organisation processing personal data must implement measures of varying degrees of importance depending on the risks to data subjects, by default and right from the design stage of processing.

In this respect, the Data Protection Impact Assessment (or DPIA) is a tool provided for by the GDPR, which helps to think about, but also to demonstrate, the compliance of a processing operation. A DPIA is compulsory for the processing of personal data likely to give rise to a high risk to the rights and freedoms of data subjects (for example, in the case of innovative use of data collected on a large scale). It should be noted that the right to protection of personal data is distinct from the right to privacy, since it is intended to protect other rights and freedoms.

The AIPD must therefore takes into account the risks to privacy, but also the risks to other rights and freedoms. These include the quality, quantity and relevance of the data (including the risk of unduly collecting sensitive data), data security risks (both for the data contained in the model and the data transmitted by users of the interface), and the risks of bias that the model could develop, particularly as a result of biases already present in the data used to train it, or biases in the choice of data selected.

The AIPD philosophy and its risk-based approach are also at the heart of the draft European regulation on AI (RIA) currently being drawn up.

The Italian authority's decision of March 31st 2023

 

On March 31st 2023, Italy's CNIL adopted an emergency decision (see here) against the US company OpenAI L.L.C., which develops and manages ChatGPT service. This decision temporarily prohibits OpenAI from processing Italian user data.

The Italian authority considered that there was sufficient doubt as to ChatGPT's compliance with the GDPR to issue this emergency measure, particularly with regard to the information provided to users whose data was used to learn the model, the legal basis for processing the model, the inaccurate nature of certain data and the absence of any age verification mechanism on a service presented as being reserved for people aged over 13.

On 12 April 2023, following a series of discussions with representatives of OpenAI, the Italian authority indicated certain requirements to be met by 30 April 2023 (relating to transparency, the rights of interested parties and the legal basis for the processing carried out by ChatGPT - read more here) in order to suspend its temporary ban.

On 13 April 2023, the European Data Protection Committee, which brings together the CNIL and all its European counterparts, decided to launch a working group on ChatGPT (read more here). Its aim is to promote cooperation and the exchange of information on possible initiatives to ensure the application of the GDPR by the various regulatory authorities.

On 28 April 2023, the Italian authority listed nine measures taken by OpenAI, enabling access to ChatGPT to be restored for Internet users based in Italy (see here). These measures, which apply throughout the European Union, include in particular : information detailing the processing carried out, the possibility for users and non-users to object to the processing of their data (via a form or by email), the introduction of measures for the deletion of inaccurate data (given that data correction appears impossible today), clarification of the way in which user data can be re-used for the purposes of improving the algorithm, without prejudice to the right to object to this, and the implementation of an age declaration mechanism (with a ban on users under the age of 18, except for minors aged between 13 and 18 who have parental consent).

 

Towards a new approach to regulating AI systems?

A new approach to personal data protection ?

As we have just seen, data protection regulations are based on very robust principles. The GDPR therefore remains not only applicable, but entirely relevant, for the processing of generative AI systems that require large quantities of data, often personal data. Indeed, a recent study by the Conseil d'État highlights "the very strong adherence between the regulation of AI systems [to come] and that of data, in particular personal data".

However, while collecting data online or re-using user data to improve or develop new services is nothing new, designing generative AI models on such a scale raises new legal and technical issues. For example, when it comes to the statistical models at the heart of generative AI, guaranteeing respect for the rights of individuals raises new issues, such as the possibility of retrieving data from the training set, or exercising one's rights directly on the model. The CNIL has been examining these issues for over a year and has already published a number of resources on the subject, particularly for professionals.

Furthermore, faced with the spectacular technological advances in AI and in response to these societal challenges, the CNIL has recently created an artificial intelligence department. In this context, it will have to analyse how the GDPR can provide a framework for the development of generative AI and its uses in order to ensure balanced regulation and enable organisations, data subjects and the CNIL to control the risks to privacy. Work in this area will be carried out in 2023 to provide initial responses.

Draft regulation on AI to complement the GDPR

The work surrounding the draft regulation on AI could give the impression that the current regulations are obsolete or outdated. However, this draft regulation is not intended to replace the GDPR for these systems, but rather to complement it.

At this stage, the draft RIA specifies a number of principles that are already present in the GDPR in a more general way. This applies in particular to the principle of transparency and the consideration of risks upstream of system design. What's more, the AI Regulation is intended to apply primarily to AI "solution providers" wishing to access the European market, whereas the GDPR mainly concerns solution users who are responsible for the processing they purchase or implement.

On the other hand, it is more precise in its risk-based approach, in particular by proposing different risk categories for each use, with the measures to be taken being more or less restrictive. In particular, the draft regulation distinguishes between AI systems presenting unacceptable risks (which are prohibited), "high-risk" systems (subject to high requirements) and "limited-risk" systems (subject to minimum requirements, including transparency).

In this respect, the question arises of finding the right level of supervision for so-called general-purpose AI systems, of which generative AI is a sub-category. Should a bespoke regime be created, as was once envisaged (not least because of the difficulty of considering all the high-risk uses at the design stage of such systems)? Should they be considered as "high-risk" AI systems, which would require significant measures to mitigate the various risks involved, such as greater transparency for users, particular attention to data fed into the system, and a high level of robustness, security and accuracy? Beyond AI regulation, in France, the report by the Comité Pilote d'Ethique du Numérique on conversational agents has issued 13 recommendations, 10 design principles and 11 research questions to be followed for this type of system. Among these, "Asserting the status of conversational agents", "Reducing the projection of moral qualities onto a conversational agent" and "Technically supervising deadbots" are recommendations whose implementation would undoubtedly make it possible to respond to the issues mentioned above.

The European co-legislators will have to answer these questions. Some commentators wonder whether the new obligations placed on the designers of such devices will be sufficient (particularly with regard to transparency obligations on general-purpose AI systems), while others fear that overly onerous additional obligations could slow down innovation by companies, to the benefit of foreign actors.

In the same way that the GDPR does not only apply to European organisations, the draft RIA would also apply to foreign suppliers who market or commission their AI systems on the European market.

 



Article rédigé par Erevan Malroux , Juriste au service des affaires économiques