[File] How to prevent data extraction bots from collecting publicly available data

Rédigé par Romain Darous

 - 

02 June 2026


The development of general-purpose AI models requires the creation of massive datasets, often built from publicly available data collected online. Website owners whose content is subject to such collection have a range of tools at their disposal to help them control the data being collected and the ways in which it may be used. The articles in this dossier first introduce the concepts related to online data scraping, before presenting the methods that website publishers can implement to manage and control the data collected by data extraction bots.

 

 

 

Introduction

 

When a publisher puts a website online, they do so to reach an audience and provide content (articles, videos, podcasts), sell products, or offer online services. However, users do not know the exact the domain name of every website they may wish to visit. Therefore, they rely on search engines, which suggest web pages based on keywords submitted by the user. To provide relevant results, search engines use automated programs that crawl the Web, analyze websites, and index them. This process is essential for both parties: it enables search engines to deliver useful search results to users, while allowing publishers to make their content visible to potential readers, customers, or users. Companies that develop search engines and website publishers can therefore both benefit from it. 

Today, some large language models can disrupt this sharing of benefits. Their training often relies on massive data collection from online sources, but the benefits received by website owners are often less evident. According to The Wall Street Journal, the rise of conversational agents and AI-generated summaries integrated into search engine results - providing access to information without clicking on links - has led to a decline in website traffic. This large-scale online data collection also raises questions about legality and about the obligations of the entity collecting this data when it includes personal data. The CNIL also recalls the obligations of the data controller in the context of developing an AI system involving the processing of personal data in its dedicated practical guidance sheets.

Website publishers have technical means to regulate this practice: they can choose whether or not to allow access to their data by implementing mechanisms to block bots, or by explicitly defining access conditions through specific files consulted by the software programs that collect their data.

 

Why regulate web data extraction practices?

 

Risks to the proper functioning of websites, …

The risks associated with data scraping, especially on a large scale, are numerous. When left unchecked, these practices pose a technical threat to website hosting. Web data extraction results in an increase in requests sent to the same server, which can lead to overloads or even server crashes, with consequences similar to distributed denial-of-service (DDoS) attacks. The recent overload of the company Triplegangers’ website caused by OpenAI’s scraping bots is one such example.

 

… for intellectual property purposes, …

Web data extraction can also infringe intellectual property rights. In a briefing note providing an overview of opt-out mechanisms, the PEReN (a French public body that supports the other public authorities in regulating digital platforms and AI by providing technical expertise) highlights copyright-related issues by citing the example of a sanction imposed on Google in 2024 for training generative AI models on data from websites and news agencies without informing them and without providing a mechanism allowing them to object to the use of their content by Google’s AI systems without affecting their visibility in the search engine. More broadly, the PEReN discusses the issue of value sharing raised by the collection of publicly accessible online data between foundation model developers and the owners of the websites whose data is used for training.

 

… and for the protection of the personal data of data subjects

Furthermore, web data extraction often involves the collection of personal data. In such cases, compliance with the GDPR is required, and it must be ensured that the data processing is lawful. For example, the CNIL recognizes legitimate interest (Article 6(1)(f) of the GDPR) as a valid legal basis for collecting publicly available online data for the purpose of developing AI systems, provided that strong safeguards are implemented. Those safeguards are detailed in a dedicated AI how-to sheet from the CNIL.

 

We therefore propose a series of three articles aimed at:

  • reviewing existing practices for collecting publicly accessible online data;
  • detailing the declarative methods that website publishers can implement to regulate such practices;
  • exploring the measures that can be used to completely block access to a website when the user is identified as a scraping bot.

 

Should measures taken be tailored?

 

Fallible methods, unsuitable for personal data

Combining declarative protocols with blocking methods (presented in the articles below) minimizes as much as possible the risk of unwanted data extraction of a website. Declarative protocols are based on existing standards and on initiatives specifically designed for data extraction bots. Nevertheless, the question remains as to whether they will be systematically respected. Blocking measures make it possible to control more generally malicious traffic on a website, including the detection of scraping bots that do not comply with declarative protocols. While, in practice, they can prevent a detected bot from browsing the website, they may also negatively affect the user experience, be intrusive, and still remain fallible.

Moreover, these methods are not specifically designed to ensure the exercise of the right to object under the GDPR when the collection concerns personal data. Since declarative methods are limited to permissions by page and/or by content type, a granular objection mechanism at the level of each individual data subject still remains to be developed.

 

Tailoring one’s objection according to the bot’s purpose

Finally, it is worth considering which types of data extraction bots should be blocked, as there are several categories serving different purposes. In its note on objection mechanisms, the PEReN proposes classifying the various bots used by LLMs into four categories: “AI data scrapers,” which collect data for the purpose of training foundation models; “AI search crawlers” or “AI assistants,” which browse the internet to provide up-to-date content for users of conversational agents; and “undocumented AI agents,” unidentified scraping bots whose data collection purpose is unknown.

Conversational agents, although prone to hallucinations, are increasingly becoming a major and sometimes exclusive source of information. They now almost systematically implement web search functionalities (through “AI search crawlers” or “AI assistants”) in order to provide reliable and up-to-date responses, which therefore credit the websites used to generate them. Users can follow the links provided to verify the sources and further their research. While this practice does not necessarily lead users to systematically click on these links, it at least gives them the opportunity to do so. It is therefore legitimate to question whether access to a website’s content by such bots should be forbidden, as this would prevent the website from being used as a reference in a conversational agent’s response.

This issue is all the more relevant given that the emergence of agentic AI means that foundation models can now perform autonomous actions on a virtual browser at a user’s request. Blocking these bots therefore amounts to blocking the human user behind the request.

 

Alternatives to opting out of web data extraction

A website publisher may therefore wish to make its content accessible to these scraping bots in order to appear among the sources used by conversational agents in the responses they provide to end users. From this objective emerges the concept of SAIO (Search AI Optimization), the equivalent of SEO (Search Engine Optimization) applied to optimizing a website’s visibility and usability for AI bots. It is also worth noting the emergence of protocols such as “LLMs.txt,” a text file inserted into a website’s source code that makes it possible to include information easily readable by a conversational agent, thereby giving it direct access to relevant content of the website. Companies such as Parallel AI or Perplexity specialize in AI-optimized search, illustrating the growing importance of this use case.

Finally, it is possible to establish contracts with companies that engage in online data collection. The PEReN notably cites partnerships between OpenAI and Le Monde, Google and Associated Press, as well as between Perplexity, Humanoid (Ebra), Mistral AI, and Agence France-Presse. Reddit also provides an API granting access to the platform’s public data in a structured and cleaned format. These partnerships make it possible to generate revenue from scraping activities and contribute to a rebalancing of the benefits derived by both parties from data collection.

 



Illustration : Nano Banana 2


Article rédigé par Romain Darous , Ingénieur au Service de l'Intelligence Artificielle