Explainable AI techniques [2/3]

Rédigé par Nicolas Berkouk, Mehdi Arfaoui et Romain Pialat

08 July 2025

In response to the opacity of deep learning, this article presents the main methods in the field of explainable AI (xAI), organising them according to a distinction between local and global approaches. While detailing these techniques, it nevertheless highlights their lack of robustness and the absence of scientific consensus on what constitutes a valid explanation, thus revealing the profound heterogeneity of the field.

As we saw in the previous article, although the foundations of deep learning (connectionist AI) date back to the 1950s, it wasn't until the mid-2010s that the conditions were met to enable them to dethrone so-called symbolic technical systems, particularly in the analysis of unstructured data such as text, audio or video.

This change fundamentally alters our understanding of algorithmic computation logic: the computation and development logic of symbolic systems is, by definition, accessible from the system code, whereas the structure of operations of deep learning algorithms is based on a massive volume of data and with no a priori idea of how to arrive at the result, thus rendering the computation logic opaque.

In the second half of the 2010s, a field of research emerged around the production of explanations for the results of deep learning algorithms to overcome this opacity: Explainable AI (or xAI). The vast majority of xAI methods aim to provide complementary information to a deep learning model, in order to give meaning and context to its results.

This article provides an introduction to the main methods in this field. Without claiming to be exhaustive, as this research community is so prolific (5500 papers published in 2024 - source: Semantic Scholar), we aim here to provide a few structuring landmarks in this literature.

To categorize these methods, we propose in this article to mobilize one criterion in particular, namely, the element of explanation, in other words, what is sought through these methods to explain. In this respect, two categories of methods can be distinguished:

Local methods, i.e. those which mobilize explanatory elements relating to the characteristics of a particular input.
Global methods, i.e. those which mobilize explanatory elements relating to the general functioning of the model, and whose principles remain unchanged whatever the input data.

In the following sections, we present the main methods according to this distinction.

Local methods

Local methods aim to establish the link between a model prediction and the input it has been given. These explanations are therefore singular for each use, and differ according to the nature of the input, and the nature of the model. Their results therefore only apply to the model's behavior on a particular input.

In this section, we separate local methods into two categories:

Methods that assume total or partial access to the internal workings of the model ("white box");
And model-agnostic methods, i.e. those that treat the model as a black box.

Methods with model access ("white-box")

Here we present two main types of "white-box" methods (there are many others):

On the one hand, there are methods that use the gradient of the model, i.e. the derivative of the model as a function from the inputs to the outputs of the system. These include the Saliency Maps method (K. Simonyan et. al.), ainsi que la méthode GradCAM (R. R. Selvaraju et. al.) pour Gradient-weighted Class Activation Mapping.
On the other hand, so-called "post hoc" methods (E. M. Kenny et. al.) work by identifying parts of the training base that activate the model in a similar way to the given input concerned.

Unlike the "back-box" methods we'll describe below, these two types of white-box method presuppose partial (gradient) or total (i.e. at every intermediate stage of computation performed by the neural network) access to the inner workings of the model.

To better understand the gradient method

The derivative of a function at a point x allows us to determine the slope ("inclination") of the straight line that best fits the curve of the function at x. The derivative of the function therefore helps us to understand the best linear approximation to the behavior of a function at a point x.

Let's take the case of a neural network trained to predict whether the digit 0 is present in an image. The model is trained on a base of black-and-white images measuring 28 x 28 pixels. The input space (the images) is made up of 784 (= 28 x 28) numerical values, each of which encodes the intensity of each pixel in an image. The network predicts a score between 0 and 1, indicating how likely it is that the digit 0 is contained in an image. Thus, the neural network is a function f, which takes as input vectors consisting of 784 values, to predict a score.

The gradient of f in a fixed image x enables us to find the linear function that best approximates f in the vicinity of x, and thus to understand the local variations of the function at this point. In particular, it's possible to see how the value of f varies when the value of a pixel in the image is changed. Pixels for which this variation is maximal, i.e. where the prediction score can be greatly modified with just a small change in the value of a pixel of x, are precisely those for which the value of the gradient of f is maximal.

Gradient methods

In Saliency Maps, as in GradCAM, the gradient of the neural network is calculated on the basis of input data, enabling us to switch from a complex model to a linear approximation, and to see for each image zone or pixel how it affects the model's prediction. Pixels for which a small variation implies a large change in prediction, i.e. those for which the gradient is high, are considered important for prediction and displayed on the original image with heat zones.

In Saliency Maps, this method is applied to every pixel in the image, whereas in GradCAM it is applied to larger areas, so that the areas of interest highlighted are more comprehensible than pixels.

Figure 1 - Explanation provided by the GradCAM algorithm, where we can see the heat map for the "Dog" and "Cat" detections. The subtlety present in the research paper is that the cat is actually described as a "tabby cat" compared to other cat types in the training database, and therefore some of its stripes are detected as influential for prediction by the algorithm (Figure taken from R. R. Selvaraju et. al.).

Examples of post hoc methods

Another way of producing an explanation of the output of a neural network from an input x is to study the internal functioning ("activation") induced by this input, by comparing it with the activations produced by the training data, for which the real value or "ground-truth" is known. In this way, these methods aim to answer the following question: "Which training data produce internal operations similar to input x?

For example, if our model predicts the presence of digits on an image, the explanation will lie in the fact that in the training database, images labeled with the same digit have activated the network in a similar way to our input.

The illustration below shows how this technique can be used to understand some of the algorithm's errors.

Figure 2 - For each input image ("query"), 3 images are proposed to explain the prediction. We can see, for example, that in the images on the right, the misspelled 8 and 9 are closer to a 3 and a 4 (Figure taken from E. M. Kenny et. al.).

"Black box" explanations

In contrast to explanations based directly on model weights, we now turn our attention to explanability methods that treat networks as black boxes, enabling us to theoretically explain any model using only knowledge of its inputs and outputs.

Two of these methods are very well known: LIME for Local Interpretable Model-agnostic Explanations (M. T. Ribeiro et. al.), and SHAP for SHapley Additive exPlanations (S. M. Lundberg et. al.).

The LIME method

The aim of the LIME method is to adapt gradient methods to a black-box context, i.e. to seek to find, in the vicinity of an input x, a linear approximation of the model f in order to understand which input variables are most influential in the calculation of f(x).

In this framework, only one access to the model's inputs and outputs is possible. The idea then is to use inputs "in the vicinity" of the input to be explained. We can therefore use the model several times to see what predictions it makes on inputs "close" to our own. Based on the recovered outputs, we recreate a local linear model using, for example, linear regression. It is this local regression that serves as the explanation.

Figure 3 - Schematic illustration of how LIME works (from M. T. Ribeiro et. al.)

Here, the model seeks to predict whether a point is a cross or a circle, and has learned during training to classify all points in the pink zone as crosses, and all points in the blue zone as circles. If we're interested in the input data symbolized by the bold cross, we decide to look at the outputs in the vicinity of our input to understand how the model operates to distinguish crosses from circles in the vicinity of this point. We can therefore draw a line separating these two categories, and our output is on the left-hand side of the line, i.e. on the "cross" side of the linear regression just calculated. The dotted line in our regression represents a local version of the model.

The SHAP method

SHAP (SHapley Additive exPlanations) is a local explanation method based on Shapley values, a concept from cooperative game theory, which quantifies the contribution of each feature to a given prediction.

In concrete terms, SHAP evaluates the average effect of a feature on the model's prediction by comparing results obtained with and without this feature, in different contexts. This involves considering numerous possible combinations of features and estimating, in an additive way, the marginal influence of each one. The result is a score assigned to each feature, representing its contribution to the deviation between the model's prediction and a reference value (such as the mean prediction).

Global methods

As previously stated, global methods aim to identify the general operating characteristics of the model, and to deduce, for each input, elements of understanding that explain how the model arrived at the calculation of its output.

Overall, these methods attempt to link areas of a neural network's computational space to concepts that make sense to humans, so as to be able to make sense of the computations performed by a model. Here, we present two approaches, from the simplest to the most sophisticated.

Conceptualizing neurons

Neural networks are algorithms that string together elementary computational units: neurons. In the course of a calculation based on input data, the activations of each neuron are calculated through successive layers. The activation of each neuron is a numerical value, generally positive and of greater or lesser magnitude.

The most basic approach to gaining an overall understanding of the neural network mechanism is to try to identify whether a strong activation of certain neurons can be correlated with the presence of a concept that would make "sense" in the input data.

For example, in a convolutional network trained for image classification, Yosinski et. Al. show that one of the network's layers seems to detect faces in an image. More precisely, as illustrated in Figure 1, when we pass an image through this neural network while viewing its projection at channel 151 of the fifth convolutional layer, we notice that the neurons that are most activated correspond to those originating from the pixels on which faces are located in the input image. If this observation is confirmed over a large number of images, we can then consider that this grouping of neurons in the 151^thlayer has specialized in the detection of one concept: the face.

It is this reverse-engineering work on the network's neurons that constitutes a global method for explaining how the model arrived at its result.

While interesting, this approach has many limitations, which were quickly identified by the academic community. To begin with, it is very difficult to rigorously and automatically characterize the concepts that would be represented by one or more neurons. What's more, there's no fundamental reason why a neuron should only be involved in encoding a single concept. Indeed, since neurons are only elementary computational units, it may well be possible for a neuron to be involved in detecting several concepts (polysemy), or for a concept to be detected on the basis of a combination (not necessarily linear) of the activations of certain neurons.

As we can see, the search for neurons whose activation can be correlated with the presence of a concept that makes sense to the user is no easy task. If this quest seems futile, it's because it's its very object (neurons) that needs to be called into question.

FIgure 4 - Adapted from Yosinski et. Al.

Mechanistic Interpretability

But then, if not among the neurons, where do we look for patterns of network activation that can be linked to concepts present in the data? Without attempting to define them precisely, we'll call such patterns features. Based on the principle that neurons can be polysemous, the search for features can only be carried out by attempting to "untangle" those characteristics that could be superimposed within several neurons. One way of moving in this direction is to try and reproduce the way the model works by training a new model, which we'll call a "toy model", for which the neurons have so-called parsimonious activations, i.e. they are forced not to all activate at the same time. We can then hope to make sense of the parsimonious activations of these new neurons more easily. These explanatory techniques are known as "mechanistic interpretation".

This work has been carried out, for example, by teams at Anthropic on their large Claude Sonnet 3 language model. It should be noted that many of these results, while promising, must be treated with caution. The model is not open-sourced, so the company's findings are not reproducible. What's more, Anthropic does not disclose the proportion of features it was able to extract from the toy model; all we know is that it captures 65% of Claude's behavior.

Fegure 5 - Representation of the "unraveling" of features from a model from Bereska et. Al.

Once we've obtained a toy model with parsimonious activations for each layer of the network, and we've managed to give "meaning" to some of the neurons in this toy model, it's possible to ask how these semantic neurons interact with each other across the network, forming circuits, to potentially trace the mobilization of this or that feature to formulate the response to a prompt.

In another article, Anthropic also claims to have used this technique to identify hidden reasoning in a version of Claude specially trained "not to say what he thinks".

Lack of scientific consensus

Matters of robustness

Sensitivity to manipulation

Explainability techniques have come in for some criticism as to their robustness, particularly in the event of attack or manipulation. In a paper by Ann-Kathrin Dombrowski et al. the authors demonstrate that it is possible to manipulate explanations without altering the prediction of the initial algorithm. For example, by inserting information undetectable to the naked eye into an image, in the manner of steganography techniques, we can lure the explanation algorithm into considering that the important pixels are not those that most influence the basic prediction.

This allows us to make the explanations say what we want, while maintaining a good prediction from the basic algorithm.

Figure 5 - Here, the neural network makes the right prediction, but the explanation is faked (authors)

The curse of dimensionality

Black-box methods such as LIME require the estimation of a linear model on the input data space. In many situations, such as image classification, this space is very high-dimensional (one dimension per number of pixels, multiplied by three if the image is in color!) This brings us face to face with a problem well known to statisticians: the curse of dimensionality. Indeed, when the dimensionality of a space is very high, it becomes very difficult to correctly estimate models (even linear ones), and a very large number of observations is required to do so, which can become very costly in terms of computing resources and lead to unstable results.

A great heterogeneity

As outlined in previous parts, xAI techniques are numerous and diverse. An important observation, which we are now in a position to make, is that these techniques are based on heterogeneous principles. In particular, the springs on which the explanation (the element that makes the explanation) is based are not the same for all the techniques presented above.

For example, gradient image explanation methods provide a saliency map indicating the intensity of the correlation between the variation in value of a pixel and the value of the model output. In this way, we can say that the explanation mechanism is correlative, since it is a correlation between variations in input variables and output values that enables it to be produced.

On the other hand, in the case of post hoc examples, the explanation is based on the fact that the input being analyzed activates the network in the same way as the examples in the training dataset. The explanation is therefore based on the principle of analogy.

As we have seen, explainability techniques are developing at a very rapid pace. They exist in very different technical forms and suffer from a lack of robustness, or vulnerability to manipulation. But on an even more fundamental level, we have seen that their heterogeneity is not just technical, but also epistemological: what constitutes an explanation can vary greatly from one method to another.

Since the production of explanations is not just a good practice, but a legal imperative to make the use of certain AI systems legal, how can we choose what constitutes an acceptable explanation? To understand these techniques, we need to look at how they are produced, and describe the workings and intentions of the research community working on these issues. In this respect, taking a more sociological look at the production of explicability methods would shed light on the relationship between AI development, regulation and scientific production... This is precisely what we propose in the next article in this dossier.

Next article [3/3] ⮕

⬅Previous article [1/3]

Illustration : Gemini.

Article rédigé par Nicolas Berkouk, Mehdi Arfaoui et Romain Pialat

VOIR PLUS D'ARTICLES DE L'AUTEUR

Contenu annexe

[Update] – Mapping of Health Data Warehouses, 2025 Edition

18 November 2025

10th IP Report - Our Data After Us

12 November 2025

Données post mortem, Innovation et prospective

Politics of Connections: Discover the proceedings of the CNIL-EHESS autumn school

28 October 2025