Eye movements on natural videos: Predictive power of different low-level features
Presented at the European Conference on Visual Perception 2007
Eleonora Vig, Michael Dorr, Thomas Martinetz, and Erhardt Barth
We used eye movements recorded for 54 subjects, who viewed two
high-resolution videos of outdoor scenes, to define an empirical
saliency (ES) measure as the density of saccade landing points. We used
ES to label a dataset of local movie blocks (17x17x8 pixels extracted
from the original videos) as "salient" and "non-salient" (1000 samples
per class). We then computed different representations: Laplacian,
colour opponency, motion, and spatio-temporal curvature K. Next, we used
two different classifiers (maximum likelihood on feature-vector length,
k-nearest-neighbour on full feature vectors) to classify the movie
blocks into the two classes for all representations.
The error
rates reflect the predictive power of the different representations.
Under all conditions, K produced the lowest error rates. For the movie
with many moving objects, motion was second best, but it was worst on
the other movie. We conclude that simple low-level predictors can
make predictions with only 15% errors, which can be reduced to 9% with a
more complex classifier, but this improvement does not generalize to a
different movie.