Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
As Machine Learning has gained adoption, one of the things on the table has been model interpretability. That is, how does the model make its decisions. What features of the input heavily influence the output.
Some models like linear classifiers and decision trees are easier to explain. Others, like neural networks, are even referred to as black boxes because of the difficulty of understanding the minute details of how they arrive at their output.
One approach to explaining a simple ML model is to use the coefficient weights to explain the importance of the features. Another approach is the use of saliency maps to give importance to the weights of pixels based on first-order derivatives.
A lot of ML models deal with features that focus on low level features that do not correspond to high level features in their layers.
The authors of this paper have formulated an equation that relates or maps the vector space of an ML model to the vectors space in which humans operate. When this mapping is linear, it is called a linear interpretability. This equation does not need to be perfect.
The authors introduce the notion of a Concept Activation Vector (CAV)as a way of translating the ML vector space to the human vector space. The CAV is derived by training a linear classifier to differentiate between a concept’s examples and random counter-examples.
TCAV learns to quantify the sensitivity of a model’s predictions to an underlying high-level concept, which is learned by the CAV. In their example, given a model that learns to recognize zebras, and a set of examples that defines “striped”, TCAV learns to quantify the influence of striped to the prediction of a zebra as a single number.
TCAV is a step towards helping humans understand how ML models arrive at the decisions that they make.
The rest of the paper goes into the technical details of the implementation and validation. You can find the paper here.