Learning a master saliency map from eye tracking data in videos

related paper: A. Coutrot, N. Guyader, Learning a time-dependent master saliency map from eye tracking data in videos, [pdf].

 

To predict the most salient regions of complex natural scenes, saliency models commonly compute several feature maps (contrast, orientation, motion...) and linearly combine them into a master saliency map. Since feature maps have different spatial distribution and amplitude dynamic ranges, determining their contributions to overall saliency remains an open problem. Most state-of-the-art models combine feature maps in a time-independent fashion. However, visual exploration is a highly dynamic process shaped by many time-dependent factors. Here we use the Least Absolute Shrinkage and Selection Operator (Lasso) algorithm to learn an time-dependent optimal linear combination of the visual features directly from eye-tracking data. Feature weights systematically vary as a function of time, particularly subject to time-dependent oculomotor tendencies such as center bias. Feature weights also heavily depend upon the semantic visual category of the videos being processed. Our fusion method allows taking these variations into account, and outperforms other state-of-the-art fusion schemes using constant feature weights over time.

 

download Matlab code

 + exemple

Left: for each video frame, the Lasso is used to learn the visual feature weights that best fit the eye position density map. At the bottom right, the lasso output show different sets of weights for different values of the regularization parameter. The weights leading to the model with the smallest Bayesian Information Criterion (BIC) is chosen.

Right: temporal evolution of the best set of feature weights (static saliency, dynamic saliency, center bias,uniform map, face1 and face2).

Comparison of the time-dependent fusion using the Lasso weights (bottom left) with the time-independent fusion proposed in Marat 2013 (bottom right). The features used are static saliency, dynamic saliency, faces, center bias, and uniform map.