Recent advances in deep learning have allowed to predict a substantial amount of the explainable information in the spatial fixation distribution in natural images. For example, our model DeepGaze II uses deep features from the VGG deep neural network trained on object recognition as image representation and combines them in a simple pixelwise nonlinear way to predict a fixation density. However, while these models are very successful at predicting fixations, they are mainly black boxes and therefore not very good at explaining what drives fixations. Here we address this problem by selecting features that are maximally predictive for fixations in a stepwise fashion (Baddeley & Tatler 2006). Starting from a version of DeepGaze II without any VGG features (a pure centerbias model), we first search for the VGG feature that maximally improves model performance when added to this model. Subsequently, we …