Predictability of eye movements analyzed in a machine learning framework
Presented at the Rovereto Attention Workshop on Attention and Motor Control 2008, Rovereto
Eleonora Vig, Michael Dorr, and Erhardt Barth
We investigate the extent to which a simple model of low-level saliency based on local spectral energy computed on different visual representations can predict saccade targets in natural dynamic scenes. Our objective is to learn transformations intended to alter the saliency distribution of the scene in real time, thus implementing gaze guidance. We analyze eye movements of 54 human subjects watching 18 high-resolution movie clips of outdoor scenes of ~20s duration each. The ~40,000 detected saccades are used to label local movie patches as attended and non-attended. The representations we use are based on the intrinsic dimension of the visual input and encode the spatio-temporal signal change. The geometric invariants H, S, and K are computed on multiple scales of an isotropic spatio-temporal multiresolution pyramid and spectral energy is extracted in the neighborhood of each location on these scales. Thus, we obtain for each location a feature vector of the same dimensionality as the number of spatio-temporal pyramid levels. To quantify the extent to which eye movements can be predicted, machine learning algorithms are used. Despite the simplicity of the representation, we achieved an ROC of 0.71; with an anisotropic pyramid, this number is 0.8. We also show that predictability correlates with the intrinsic dimension: the higher the intrinsic dimension, the higher the predictive power. Distinct from previous approaches, our method has the advantage of keeping the number of feature-space dimensions low, despite using information from multiple scales.