Analyzing bottom-up saliency in natural movies
Analyzing bottom-up saliency in natural movies (Presented at the European Conference on Visual Perception 2010)
Eleonora Vig, Michael Dorr, and Erhardt Barth
We investigate the contribution of local spatio-temporal variations of
image intensity to saliency. To measure different types of variations,
we use invariants of the structure tensor. Considering a video to be
represented in spatial axes (x,y), and temporal axis t, the
n-dimensional structure tensor (nD-ST) can be evaluated for different
combinations of axes (2D- and 3D-ST) and also for the (degenerate) case
of only one axis (1D-ST).
Eye movements recorded on 18 natural videos are used to label locations as fixated or non-fixated. For each location, we compute the invariants (products of eigenvalues) of the nD-ST and use these to predict eye movements on unseen videos with an SVM classifier. We show that the 3D-ST is optimal (average ROC score of 0.656), which means that the most predictive regions of a movie are those where intensity varies along all spatial and temporal directions. Analyzing 2-dimensional variations, the 2D-ST evaluated on the axes (y,t) gave the best score (0.638), followed by (x,y) (0.626), and (x,t) (0.625). The 1D-ST yielded 0.606 along the temporal, 0.604 for horizontal, and 0.602 for vertical axis. We conclude that bottom-up saliency is determined by spatio-temporal variations of image intensity rather than spatial or temporal variations.