Explaining Neural Network Decisions Is Hard


We connect the widespread idea of interpreting classifier decisions to probabilistic prime implicants. A set of input features is deemed relevant for a classification decision if the classifier score remains nearly constant when randomising the remaining features. This introduces a rate-distortion trade-off between the set size and the deviation of the score. We explain how relevance maps can be interpreted as a greedy strategy to calculate the rate-distortion function. For neural networks we show that approximating this function even in a single point up to any non-trivial approximation factor is NP-hard. Thus, no algorithm will provably find small relevant sets of input features even if they exist. Finally, as a numerical comparison we express a Boolean function, for which the prime implicant sets are known, as a neural network and investigate which relevance mapping methods are able to highlight them.

XXAI Workshop: Extending Explainable AI Beyond Deep Models and Classifiers, 37th International Conference on Machine Learning (ICML 2020)
Jan Macdonald
Jan Macdonald

My research is at the interface of applied and computational mathematics and scientific machine learning. I am interested in inverse problems, signal- and image recovery, and robust and interpretable deep learning.