DeepGaze vs SceneWalk: what can DNNs and biological scan path models teach each other?

Abstract

Eye movements on natural scenes are driven by image content as well as by saccade dynamics and sequential dependencies. Recent research has seen a variety of models that aim to predict time-ordered fixation sequences, including statistical-, mechanistic-, and deep neural network (DNN) models, each with their own advantages and shortcomings. Here we show how a synthesis of different modeling frameworks may offer fresh insights into the underlying processes. Firstly, the explanatory power of biologically inspired models can help develop an understanding of mechanisms learned by DNNs. Secondly, DNN performance can be used to estimate data predictability and thereby help uncover new mechanisms. DeepGaze3 (DG3) is currently the best-performing DNN model for scan path predictions (Kümmerer & Bethge, 2020); SceneWalk (SW) is the best-performing biologically inspired dynamical model (Schwetlick et al., 2021). Both models can be fitted using maximum likelihood estimation and compute per-fixation likelihood predictions. Thus, we can analyze prediction divergence at the level of individual fixations. DG3 generally outperforms SW, indicating that the DNN is accounting for variance by learning mechanisms that are not yet included in the mechanistic SW model. Preliminary results show that SW tends to underestimate the probability of long, explorative saccades. In SW this behavior could be achieved by replacing the Gaussian attention span with a function with heavier tails or by implementing temporal attention span fluctuation. Furthermore, DG3 appears to compress previously unexplored areas, increasing likelihood for saccades to the region center. Once the region is fixated, DG3 broadens the local probability, consistent with a dualistic exploration-exploitation strategy. Adding corresponding mechanisms to SW may improve model performance and help develop more advanced dynamical models. Finding the synergies between different modeling approaches, specifically high-performing DNNs and more transparent dynamical models, is a valuable tool for improving our understanding of fixation selection during scene viewing.

Matthias Kümmerer
Matthias Kümmerer
Postdoc

I’m interested in understanding how we use eye movements to gather information about our environment. This includes building saliency models and models of eye movement prediction such as my line of DeepGaze models. I also work on the question of how to evaluate model quality and benchmarking and I’m the main organizer of the MIT/Tuebingen Saliency Benchmark.

Matthias Bethge
Matthias Bethge
Professor for Computational Neuroscience and Machine Learning & Director of the Tübingen AI Center

Matthias Bethge is Professor for Computational Neuroscience and Machine Learning at the University of Tübingen and director of the Tübingen AI Center, a joint center between Tübingen University and MPI for Intelligent Systems that is part of the German AI strategy.