[Article 2/2] Data-hungry World Models: not without risks

Rédigé par Régis Chatellier

25 March 2026

Data-hungry world models are not without risk. According to the designers of world models, understanding the physical world requires data far richer than text or images alone, and the sources of such data are not always clearly defined. Both the conditions under which this data is collected and the ways in which these models will be used once deployed raise new challenges for individuals’ rights.

World models, like language models and multimodal models, require substantial datasets for their training. For instance, the V-JEPA 2 model required over one million hours of video. As such, they raise questions regarding the types of data collected, their sources, and associated rights. These issues and risks also persist once the models are deployed in production.

What data are needed to train them?

Although non-contrastive self-supervised learning requires less data than contrastive methods, world models still demand a substantial volume of training data potentially far larger than that needed for language models. This no longer involves merely textual data, or even “flat” 2D images, but increasingly includes video and all types of data representing the world.

In his paper, Yann Le Cun specifies the sources and modes of acquisition necessary for training the models:

Textual data: although the starting point of the article focuses on the limitations of language models, he acknowledges that text remains an important source of high-level knowledge, even if it is insufficient: “A large portion of human ‘common sense’ knowledge is not represented in any text and results from our interaction with the physical world. As LLMs have no direct experience of the underlying reality, the type of common-sense knowledge they display is very superficial and can be disconnected from reality.” This is why he turns to additional types of data.

Diverse “sensory streams”: video, audio, touch: for training, the models require streams from sensors, such as video (to learn intuitive physics, depth, and object permanence), audio, touch signals, and speech. These are precisely the types of signals that humans (and non-human) acquire during the first months and years of life

These “sensory” data must not be merely static, but dynamic, coming from multiple sources and modes of acquisition:

Passive observation: through sensor streams such as video or audio;
Active foveation: directing gaze, attention, or sensor orientation without affecting the environment;
Observation of another agent: acting on the environment to infer the causal effects of actions;
Active egomotion: moving a sensor or camera relative to a real or virtual environment without significantly altering it;
Active agency: learning to predict the consequences of one’s own actions by directly influencing the sensory streams.

Thus, much of the learning would occur through observation of the real world, much like humans and animals, acquiring vast amounts of foundational knowledge about how the world works with very few direct interactions, particularly in the early stages of development. This raises the question of the sources of data needed to feed and train these world models.

In the article, as well as during a presentation at the AI-Pulse 2025 event held in Paris in November 2025, Yann Le Cun specified the types of sources to be used for training. Platforms such as YouTube provide enormous quantities of visual video data, ideal for self-supervised learning. According to him, these videos have the advantages of being “easy to obtain” and “public” (sic), allowing models to be trained on much larger volumes of data continuous, noisy, high-density signals (richer than textual data) and redundant. For example, Meta used the equivalent of 100 years of video to train the V-JEPA video model (see below).

In addition to these sources, the models require training on data that link perceptions to actions. These can come from several sources:

Video games: interaction data that allow the simulation of environments where actions can be taken and their consequences observed;
Robotics: data from robot simulations;
Real-world data: capturing physical interactions, including vision, touch, and proprioception.

At this stage, little information is provided regarding sensory data. Smart glasses are seen as a potential source of data, as these devices can record everyday tasks from a human point of view (“first-person view”), providing perception data that closely approximates human experience. As for touch-related data, there is no information on the methods for collecting it or on the types of data involved, even though such data would be personal and potentially sensitive.

What are the challenges interms of right and data protection?

As of February 2026, while numerous projects fall within the wave of world models, most remain, at this stage, “classical” generative models. Autonomous intelligence models are still largely hypothetical, and it is unclear if, or when, they will truly reach the market and fulfil their promises. Researchers and AI experts quoted by Libération express scepticism about the project, at least in the short term.

The training and deployment of world models nevertheless raise significant questions regarding data protection, the AI systems themselves, and, more broadly, ethical considerations.

Data protection and individual rights

All world models whether video generation models, video game models, or autonomous intelligence models rely on learning from vast amounts of data, far larger than what is required for “classical” language models. In this respect, the specificity of world models lies in the exponential scale of data used for training, which is significantly greater than that needed for language models.

The risks described by the CNIL in its guidance on the collection of online data via web scraping also apply to the “video” data used to train these models.

The guidance notes that the use of these tools carries risks to privacy and the rights guaranteed under the GDPR, which can have significant impacts on individuals. This is particularly due to “the large volume of data collected, the high number of individuals concerned, difficulties in exercising the right to erasure, the risk of collecting personal data (e.g., from social media), or even sensitive or highly personal data, in the absence of sufficient safeguards. These risks are all the greater because they may also involve data concerning vulnerable individuals, such as minors, who require special attention and must be informed in an appropriately tailored manner.”

There is also a risk of illegal data collection, insofar as “certain data may be protected by specific rights, notably intellectual property rights, or their reuse may be conditional on the consent of the individuals concerned.” More broadly, training these models carries a risk to freedom of expression, since “undifferentiated and massive data collection, and their absorption into AI systems capable of reproducing them, can affect the freedom of expression of the individuals involved (creating a sense of surveillance that may lead users to self-censor, particularly given the difficulty of excluding published data from scraping practices), even though the use of certain platforms and communication tools is necessary on a daily basis.”

Risks specific to « sensory » data

World models particularly those aligned with the vision proposed by Yann Le Cun require training on new types of data, which go beyond text or publicly shared videos to what he terms “sensory” data. While he does not provide details on how touch-related data might be collected, he highlights the use of smart glasses for data combining image and movement, in a “first-person view” format. This comes at a time when the smart glasses market is rapidly expanding, driven notably by Meta with its Ray-Ban glasses, Google, and a multitude of other actors, particularly from the United States and China. In January, Le Figaro ran a headline on the spectacular revival of smart glasses at the CES in Las Vegas.

By definition, sensory data are linked to a physical person, whether it is the wearer of the smart glasses or any other type of sensor. It is therefore essential to exercise caution regarding the sources of these data and the conditions under which their collection and processing take place.

The maximalist approach proposed for world models tends to make the implementation of the data minimisation principle more complex in AI system development. While it is not prohibited to train an algorithm with very large volumes of data, the principle of minimisation requires developers to carefully consider, prior to training, which personal data are truly necessary for system development. In the case of so-called sensory data, the nature of the data under the GDPR must also be addressed. Indeed, “sensory” data could fall into the category of sensitive or “highly personal” data, potentially giving rise to high-risk situations.

These new world models, in the way they are presented, reveal an extension of data collection and of the associated issues that have been discussed in recent years, including matters relating to copyright, the quality of the data collected (and the biases they may embed), and individuals’ rights.

Risks associated with world generation and predictions

The ability to generate hyper-realistic text, images, and videos for the spread of false information or disinformation is already one of the major challenges facing our societies and democracies. The capabilities offered by generative models, particularly for audio and video, tend to amplify this phenomenon and pose a regulatory challenge. Detecting AI-generated content has become a major research priority to address these risks: “it is becoming increasingly difficult to resolve due to advances in generative AI, and it will be even more challenging with the advent of world models capable of producing coherent, multidimensional outputs” (Ding, Zang, et al.). In 2023, the LINC published an article on digital watermarking of AI-generated content as a transparency measure.

When it comes to using world models for autonomous vehicles, robotics, or any other application with a direct and tangible impact on the physical world, the risks of errors or hallucinations in these new models are very real. A single prediction error could lead to accidents. It should be noted that, as of 2026, according to Tesla’s own calculations which, however, are not fully disclosed its robotaxis perform four times worse than humans in driving tasks. The use of predictive models also carries ethical and legal risks in other fields, for example when influencing medical or military decisions. Language models are already being employed in military operations. According to The Wall Street Journal, cited by The Guardian, Claude, Anthropic’s model, was used in the United States’ attack against Iran “for intelligence purposes, as well as to assist in target selection and battlefield simulations.”

Worlds still uncertain

World models represent a new stage in the rapidly expanding development of artificial intelligence over recent years. They are part of a global race to develop large-scale models, even as researchers such as Yoshua Bengio, co-recipient of the 2019 Turing Award with Yann Le Cun, warn about the risks of AI, its environmental impact, and the importance of having ethics committees involved in projects. The startup AMI, launched in January 2026, had already raised $1 billion by March 2026, despite there being no announced dates for product or service launches, nor for its business models. As one of the co-founders, Alexandre Lebrun, explained to Le Figaro: “We are working on an ambitious, long-term project. The funding will finance this research.” Regarding the risks associated with these new models, Yann Le Cun told Libération on 10 March 2026 that, “In the end, the decision on what constitutes the best use of AI for society should not rest in the hands of someone like me, or my colleagues. It is up to society and its democratic institutions to decide.”

Document reference