Explaining Neural Network Decisions Is Hard

Jan Macdonald, Stephan Wäldchen, Sascha Hauch, Gitta Kutyniok

July, 2020

Abstract

We connect the widespread idea of interpreting classiﬁer decisions to probabilistic prime implicants. A set of input features is deemed relevant for a classiﬁcation decision if the classiﬁer score remains nearly constant when randomising the remaining features. This introduces a rate-distortion trade-off between the set size and the deviation of the score. We explain how relevance maps can be interpreted as a greedy strategy to calculate the rate-distortion function. For neural networks we show that approximating this function even in a single point up to any non-trivial approximation factor is NP-hard. Thus, no algorithm will provably ﬁnd small relevant sets of input features even if they exist. Finally, as a numerical comparison we express a Boolean function, for which the prime implicant sets are known, as a neural network and investigate which relevance mapping methods are able to highlight them.

Type

Workshop contribution

Publication

XXAI Workshop: Extending Explainable AI Beyond Deep Models and Classifiers, 37th International Conference on Machine Learning (ICML 2020)

Deep Neural Networks Explainable Neural Networks