A Learning Theoretic Perspective on Local Explainability

Fig 1: A formal relationship between interpretability and complexity. Going from left to right, we consider increasingly complex functions. As the complexity increases, local linear explanations can approximate the function only in smaller and smaller neighborhoods. These neighborhoods, in other words, need to become more and more disjoint as the function becomes more complex. Indeed, we quantify “disjointedness” of the neighborhoods via a term denoted by (rho_S) and relate it to the complexity of the function class, and subsequently, its generalization properties.


There has been a growing interest in interpretable machine learning (IML), towards helping users better understand how their ML models behave. IML has become a particularly relevant concern especially as practitioners aim to apply ML in important domains such as healthcare [Caruana et al., ’15], financial services [Chen et al., ’18], and scientific discovery [Karpatne et al., ’17].

While much of the work in IML has been qualitative and empirical, in our recent ICLR21 paper, we study how concepts in interpretability can be formally related to learning theory. At a high level, the connection between these two fields seems quite natural. Broadly, one can consider there to be a trade-off between notions of a model’s interpretability and its complexity. That is, as a model’s decision boundary gets more complicated (e.g., compare a sparse linear model vs. a neural network) it is harder in a general sense for a human to understand how it makes decisions. At the same time, learning theory commonly analyzes relationships between notions of complexity for a class of functions (e.g., the number of parameters required to represent those functions) and the functions’ generalization properties (i.e., their predictive accuracy on unseen test data). Therefore, it is natural to suspect that, through some appropriate notion of complexity, one can establish connections between interpretability and generalization.

How do we establish this connection? First, IML encompasses a wide range of definitions and problems, spanning both the design of inherently interpretable models as well as post-hoc explanations for black-boxes (e.g. including but not limited to approximation [Riberio et al., ’16], counterfactual [Dhurandhar et al., ’18], and feature importance-based explanations [Lundberg & Lee, ’17]). In our work, we focus on a notion of interpretability that is based on the quality of local approximation explanations. Such explanations are a common and flexible post-hoc approach for IML, used by popular methods such as LIME [Riberio et al., ’16] and MAPLE [Plumb et al., ‘18] which we’ll briefly outline later in the blog. We then answer two questions that relate this notion of local explainability to important statistical properties of the model:

  1. Performance Generalization: How can a model’s predictive accuracy on unseen test data be related to the interpretability of the learned model? 
  2. Explanation Generalization: We look at a novel statistical problem that arises in a growing subclass of local approximation algorithms (such as MAPLE and RL-LIM [Yoon et al., ‘19]). Since these algorithms learn explanations by fitting them on finite data, the explanations may not necessarily fit unseen data well. Hence, we ask, what is the quality of those explanations on unseen data?

In what follows, we’ll first provide a quick introduction to local explanations. Then, we’ll motivate and answer our two main questions by discussing a pair of corresponding generalization guarantees that are in terms of how “accurate” and “complex” the explanations are. Here, the “complexity” of the local explanations corresponds to how large of a local neighborhood the explanations span (the larger the neighborhood, the lower the complexity — see Fig 1 for a visualization). For question (1), this results in a bound that roughly captures the idea that an easier-to-locally-approximate (f) enjoys better performance generalization. For question (2), our bound tells us that, when the explanations can accurately fit all the training data that fall within large neighborhoods, the explanations are likely to fit unseen data better. Finally, we’ll examine our insights in practice by verifying that these guarantees capture non-trivial relationships in a few real-world datasets. 

Local Explainability

Local approximation explanations operate on a basic idea: use a model from a simple family (like a linear model) to locally mimic a model from a more complex family (like a neural network model). Then, one can directly inspect the approximation (e.g. by looking at the weights of the linear model). More formally, for a given black-box model (f : mathcal{X} rightarrow mathcal{Y}), the explanation system produces at any input (x in mathcal{X}) , a ”simple” function (g_x(cdot) : mathcal{X} rightarrow mathcal{Y}) that approximates (f) in a local neighborhood around (x) . Here, we assume (f in mathcal{F}) (the complex model class) and (g_x in mathcal{G}_{text{local}} ) (the simple model class). 

As an example, here’s what LIME (Local Interpretable Model-agnostic Explanations) does. At any point (x), in order to produce the corresponding explanation (g_x), LIME would sample a bunch of perturbed points in the neighborhood around (x) and label them using the complex function (f(x)). It would then learn a simple function that fits the resulting dataset. One can then use this simple function to better understand how (f) behaves in the locality around (x).

Performance Generalization

Our first result is a generalization bound on the squared error loss of the function (f). Now, a typical generalization bound would look something like

$$text{TestLoss}(f) ≤ text{TrainLoss}(f) + sqrt{ frac{text{Complexity}(mathcal{F})}{text{# of training examples}}  }$$

where the bound is in terms of how well (f) fits the training data, and also how “complex” the function class (mathcal{F}) is. In practice though, (mathcal{F}) can often be a very complex class, rendering these bounds too large to be meaningful.

Yet while (mathcal{F}) is complex, what if the function (f) might itself have been picked from a subset of (mathcal{F}) that is in some way much simpler? For example, this is the case in neural networks trained by gradient descent [Zhang et al., ‘17, Neyshabur et al., ‘15]). Capturing this sort of simplicity could lead to more interesting bounds that (a) aren’t as loose and/or (b) shed insight into what meaningful  properties of (f) can influence how well it generalizes. While there are many different notions of simplicity that different learning theory results have studied, here we are interested in quantifying simplicity in terms of how “interpretable” (f) is, and relate that to generalization. 

To state our  result, imagine that we have a training set (S = {(x_i,y_i)}_{i=1}^{m})sampled from the data distribution (D). Then, we show the following bound on the test-time squared error loss:

$$underbrace{mathbb{E}_{D}[(f(x)-y)^2]}_{text{Test Loss}} leq underbrace{hat{mathbb{E}}_{S}[(f(x)-y)^2]}_{text{Train Loss}} + underbrace{mathbb{E}_{x sim D} mathbb{E}_{x’ sim N_{x}}[(g_{x’}(x) – f(x))^2]}_{text{Explanation Quality (MNF)}} + underbrace{rho_S cdot mathcal{R}(mathcal{G}_{local})}_{substack{text{Complexity of} \ text{Explanation System}}}.$$

Let’s examine these terms one by one.

Train Loss: The first term, as is typical of many generalization bounds, is simply the training error of (f) on (S). 

Explanation Quality (MNF): The second term captures a notion of how interpretable (f) is, measuring how accurate the set of local explanations (g) is with a quantity that we call the “mirrored neighborhood fidelity” (MNF). This metric is actually a slight modification of a standard notion of local explanation quality used in IML literature, called neighborhood fidelity (NF) [Riberio et al., ’16; Plumb et al., ‘18]. More concretely, we explain how MNF and NF are calculated below in Fig 2.

Fig 2. How MNF is calculated: We use orange to denote “source points” (where explanations are generated) and teal to denote “target points” (where approximation error is computed). To compute the inner expectation for MNF, we sample a single target point (x sim D). Then, we sample a source point (x’) from a “neighborhood” distribution (N_x) (typically a distribution centered at (x)). We then measure how well (f) is approximated at (x) by the explanation generated at (x’). Averaging over (x) and (x’), we define (text{MNF} = mathbb{E}_{x sim D} mathbb{E}_{x’ sim N_{x}}[(g_{x’}(x) – f(x))^2]). To get NF, we simply need to swap (x) and (x’) in the innermost term: (text{NF} = mathbb{E}_{x sim D} mathbb{E}_{x’ sim N_{x}}[(g_{x}(x’) – f(x’))^2]).

While notationally the differences between MNF and the standard notion of NF are slight, there are some noteworthy differences and potential (dis)advantages of using MNF over NF from an interpretability point of view. At a high level, we argue that MNF offers more robustness (when compared to NF) to any potential irregular behavior of (f) off the manifold of the data distribution. We discuss this in greater detail in the full paper.

Complexity of Explanation System: Finally, the third and perhaps the most interesting term measures how complex the infinite set of explanation functions ({g_x}_{x in mathcal{X}}) is. As it turns out, this system of explanations ({g_x}_{x in mathcal{X}}), which we will call (g), has a complexity that can be nicely decomposed into two factors. One factor, namely (mathcal{R}(mathcal{G}_{text{local}})), corresponds to the (Rademacher) complexity of the simple local class (mathcal{G}_{text{local}}), which is going to be a very small quantity, much smaller than the complexity of (mathcal{F}). Think of this factor as typically scaling linearly with the number of parameters for (mathcal{G}_{text{local}}) and also with the dataset size (m) as (1/sqrt{m}). The second factor is (rho_S), and is what we call the “neighborhood disjointedness” factor. This factor lies between ([1, sqrt{m}]) and is defined by how little overlap there is between the different local neighborhoods specified for each of the training datapoints in (S). When there is absolutely no overlap, (rho_S) can be as large as (sqrt{m}), but when all these neighborhoods are exactly the same, (rho_S) equals (1). 

Implications of the overall bound: Having unpacked all the terms, let us take a step back and ask: assuming that (f) has fit the training data well (i.e., the first term is small), when are the other two terms large or small? We visualize this in Fig 1. Consider the case where MNF can be made small by approximating (f) by (g) on very large neighborhoods (Fig 1 left). In such a case, the neighborhoods would overlap heavily, thus keeping (rho_S) small as well. Intuitively, this suggests good generalization when (f) is “globally simple”. On the other hand, when (f) is too complex, then we need to either shrink the neighborhoods or increase the complexity of (mathcal{G}_{text{local}}) to keep MNF small. Thus, one would either suffer from MNF or (rho_S) exploding, suggesting bad generalization. In fact, when (rho_S) is as large as (sqrt{m}), the bound is “vacuous” as the complexity term no longer decreases with the dataset size (m), suggesting no generalization!

Explanation Generalization

We’ll now turn to a different, novel statistical question which arises when considering a number of recent IML algorithms. Here we are concerned with how well explanations learned from finite data generalize to unseen data. 

To motivate this question more clearly, we need to understand a key difference between canonical and finite-sample-based IML approaches. In canonical approaches (e.g. LIME), at different values of (x’), the explanations (g_{x’}) are learned by fitting on a fresh bunch of points (S_{x’}) from a (user-defined) neighborhood distribution (N_{x’}) (see Fig 3, top). But a growing number of approaches such as MAPLE and RL-LIM learn their explanations by fitting (g_{x’}) on a “realistic” dataset (S) drawn from (D) (rather than from an arbitrary distribution) and then re-weighting the datapoints in (S) depending on a notion of their closeness to (x’) (see Fig 3, bottom).

Now, while the canonical approaches effectively train (g) on an infinite dataset (cup_{x’ in mathcal{X}} S_{x’}), recent approaches train (g) on only that finite dataset (S) (reused for every (x’)). 

Using a realistic (S) has certain advantages (as motivated in this blog post), but on the flip side, since (S) is finite, it can potentially result in a severe chance of overfitting (we visualize this in Fig 3 right). This makes it valuable to seek a guarantee on the approximation error of (g) on test data (“Test MNF”) in terms of its fit on the training data (S) (“Train MNF”). In our paper, we derive such a result below: 

$$underbrace{mathbb{E}_{x sim D} mathbb{E}_{x’ sim N_{x}}[(g_{x’}(x) – f(x))^2]}_{text{Test MNF}} leq underbrace{hat{mathbb{E}}_{x sim S} mathbb{E}_{x’ sim N_{x}}[(g_{x’}(x) – f(x))^2]}_{text{Train MNF}} + rho_S cdot mathcal{R}(mathcal{G}_{local}). $$

As before, what this bound implies is that when the neighborhoods have very little overlap, there is poor generalization. This indeed makes sense: if the neighborhoods are too tiny, any explanation (g_{x’}) would have been trained on a very small subset of (S) that falls within its neighborhood. Thus the fit of (g_{x’}) won’t generalize to other neighboring points.

Fig 3. Difference between canonical local explanations (a) vs. finite-sample-based explanations (b and c): On the top panel (a), we visualize how one would go about generating explanations for different source points in a canonical method like LIME. In the bottom panels (b and c), we visualize the more recent approaches where one uses (and reuses) a single dataset for each explanation. Crucially, to learn an explanation at a particular source point, these procedures correspondingly re-weight this common dataset (visualized by the orange square boxes which are more opaque for points closer to each source point). In panel (b), the common dataset is large enough that it leads to good explanations; but in panel (c), the dataset is too small that the explanations do not generalize well to their neighborhoods.


While our theoretical results offer insights that make sense qualitatively, we also want to make a case empirically that they indeed capture meaningful relationships between the quantities involved. Particularly, we explore this for neural networks trained on various regression tasks from the UCI suite and in the context of the “explanation generalization” bound. That is we learn explanations to fit a finite dataset (S) by minimizing Train MNF, and then evaluate what Test MNF is like. Here, there are two important relationships we empirically establish:

  1. Dependence on (rho_S): Given that our bounds may be vacuous for large (rho_S=sqrt{m}), does this quantity actually scale well in practice (i.e. less than (O(sqrt{m})))? Indeed, we observe that we can find reasonably large choices of the neighborhood size ( sigma ) without causing Train MNF to become too high (somewhere around (sigma = 1) in Fig 4 bottom) and for which we can also achieve a reasonably small (rho_S approx O(m^{0.2})) (Fig 4 top).
  2. Dependence on neighborhood size: Do wider neighborhoods actually lead to improved generalization gaps? From Fig 4 bottom, we do observe that as the neighborhood width increases, TrainMNF and TestMNF overall get closer to each other, indicating that the generalization gap decreases (Fig 4 bottom).

Fig 4. Empirical study of our bounds For various neighborhood widths (sigma), in the top, we plot the approximate exponent of (rho_S)‘s polynomial growth rate i.e., the exponent (c) in (rho_S = O(|S|^c)). Below, we plot train/test MNF. We observe a tradeoff here: increasing (rho_S) results in better values of (rho_S) but hurts the MNF terms.


We have shown how a model’s local explainability can be formally connected to some of its various important statistical properties. One direction for future work is to consider extending these ideas to high-dimensional datasets, a challenging setting where our current bounds become prohibitively large. Another direction would be to more thoroughly explore these bounds in the context of neural networks, for which researchers are in search of novel types of bounds [Zhang et al., ‘17; Nagarajan and Kolter ‘19].

Separately, when it comes to the interpretability community, it would be interesting to explore the advantages/disadvantages of evaluating and learning explanations via MNF rather than NF. As discussed here, MNF appears to have reasonable connections to generalization, and as we show in the paper, it may also promise more robustness to off-manifold behavior.

To learn more about our work, check out our upcoming ICLR paper. Moreover, for a broader discussion about IML and some of the most pressing challenges in the field, here is a link to a recent white paper we wrote.


Jeffrey Li, Vaishnavh Nagarajan, Gregory Plumb, and Ameet Talwalkar, 2021, “A Learning Theoretic Perspective on Local Explainability“, ICLR 2021.

Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad, 2015, “Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission.” ACM SIGKDD, 2015.

Chaofan Chen, Kancheng Lin, Cynthia Rudin, Yaron Shaposhnik, Sijia Wang, and Tong Wang, 2018, “An interpretable model with globally consistent explanations for credit risk.” NeurIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: the Impact of Fairness, Explainability, Accuracy, and Privacy, 2018.

Anuj Karpatne, Gowtham Atluri, James H. Faghmous, Michael Steinbach, Arindam Banerjee, Auroop Ganguly, Shashi Shekhar, Nagiza Samatova, and Vipin Kumar, 2017, “Theory-guided data science: A new paradigm for scientific discovery from data.” IEEE Transactions on Knowledge and Data Engineering, 2017.

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin, 2016, “Why should I trust you?: Explaining the predictions of any classifier.” ACM SIGKDD, 2016.

Scott M. Lundberg, and Su-In Lee, 2017, “A unified approach to interpreting model predictions.” NeurIPS, 2017.

Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, Karthikeyan Shanmugam, and Payel Das, 2018, “Explanations based on the Missing: Towards Contrastive Explanations with Pertinent Negatives” NeurIPS, 2018.

Jinsung Yoon, Sercan O. Arik, and Tomas Pfister, 2019, “RL-LIM: Reinforcement learning-based locally interpretable modeling”, arXiv 2019 1909.12367.

Gregory Plumb, Denali Molitor and Ameet S. Talwalkar, 2018, “Model Agnostic Supervised Local Explanations“, NeurIPS 2018.

Vaishnavh Nagarajan and J. Zico Kolter, 2019, “Uniform convergence may be unable to explain generalization in deep learning”, NeurIPS 2019

Behnam Neyshabur, Ryota Tomioka, Nathan Srebro, 2015, “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning”, ICLR 2015 Workshop.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, 2017, “Understanding deep learning requires rethinking generalization”, ICLR’ 17.

Valerie Chen, Jeffrey Li, Joon Sik Kim, Gregory Plumb, and Ameet Talwalkar, 2021, “Towards Connecting Use Cases and Methods in Interpretable Machine Learning“, arXiv 2021 2103.06254.

Read More