Predicting where humans choose to fixate can help understanding a variety of human behaviour. The last years have seen substantial progress in predicting spatial fixation distributions when viewing static images. Our own model" DeepGaze II"(Kümmerer et al., ICCV 2017) extracts pretrained deep neural network features from the VGG network from input images and uses a simple pixelwise readout network to predict fixation distributions from these features. DeepGaze II is state-of-the-art for predicting freeviewing fixation densities according to the established MIT Saliency Benchmark. However, DeepGaze II predicts only spatial fixation distributions instead of scanpaths. Therefore, the models model ignores crucial structure in the fixation selection process. Here we extend DeepGaze II to predict fixation densities conditioned on the previous scanpath. We add additional feature maps encoding the previous scanpath …