Deep feature matching vs spatio-temporal energy filtering for robust moving object segmentation

Abstract

Recent methods for optical flow estimation achieve remarkable precision and are successfully applied in downstream tasks such as segmenting moving objects. These methods are based on matching deep neural network features across successive video frames. For humans, in contrast, the dominant motion estimation mechanism is believed to rely on spatio-temporal energy filtering. Here, we compare both motion estimation approaches for segregating a moving object from a moving background. We render synthetic videos based on scanned 3d objects and backgrounds to obtain ground truth motion for realistic scenes. We transform the videos by replacing the textures with random dots that follow the motion of the original video. This way, each individual frame does not contain any other information about the object apart from the motion signal. Humans have been shown to be able to use random dot motion for recognizing objects in these stimuli (Robert et al. 2023). We compare segmentation methods based on the recent RAFT optical flow estimator (Teed and Deng 2020) and the spatio-temporal energy model of Simoncelli & Heeger (1998). Our results show that the spatio-temporal energy approach works almost as well as using RAFT for the original videos when combined with an established segmentation architecture. Furthermore, we quantify the amount of segmentation information that can be decoded from both models when using the optimal non-negative superposition of feature maps for each video. This analysis confirms that both optic flow representations can be used for motion segmentation while RAFT performs slightly better for the original videos. For the random dot stimuli however, hardly any information about the object can be decoded from RAFT while the brain-inspired spatio-temporal energy filtering approach is only mildly affected. Based on these results we explore the use of spatio-temporal filtering for building a more robust model for moving object segmentation.

Publication
VSS 2024
Matthias Tangemann
Matthias Tangemann
PhD candidate
Matthias Kümmerer
Matthias Kümmerer
Postdoc

I’m interested in understanding how we use eye movements to gather information about our environment. This includes building saliency models and models of eye movement prediction such as my line of DeepGaze models. I also work on the question of how to evaluate model quality and benchmarking and I’m the main organizer of the MIT/Tuebingen Saliency Benchmark.

Matthias Bethge
Matthias Bethge
Professor for Computational Neuroscience and Machine Learning & Director of the Tübingen AI Center

Matthias Bethge is Professor for Computational Neuroscience and Machine Learning at the University of Tübingen and director of the Tübingen AI Center, a joint center between Tübingen University and MPI for Intelligent Systems that is part of the German AI strategy.