Publications

For an up to date list of publications including conferences and seminars, see my cv or my Google Scholar profile

PhD thesis Influence of sound while exploring dynamic natural scenes, defended on October 2014. [pdf, French]

Won the best PhD award from Grenoble-Alpes University. See this 6 minutes video in French, with English subtitles.

Abstract

We study the influence of different audiovisual features on the visual exploration of dynamic natural scenes. We show that, whilst the way a person explores a scene primarily relies on its visual content, sound sometimes significantly influences eye movements. Sound assures a better coherence between the eye posi- tions of different observers, attracting their attention and thus their gaze toward the same regions. The effect of sound is particularly strong in conversation scenes, where the related speech signal boosts the number of fixations on speakers’ faces, and thus increases the consistency between scanpaths. We propose an audiovisual saliency model able to automatically locate speakers’ faces so as to enhance their saliency. These results are based on the eye movements of 148 participants recorded on more than 75,400 frames (125 videos) in 5 different experimental conditions.

manuscript on TEL-HAL (in French).

Pre-prints

Zisch, Fiona E* and Newton, Coco* and Coutrot, Antoine* and Murcia, Maria and Motala, Anisa and Greaves, Jacob and de Cothi, William and Steed, Anthony and Tyler, Nick and Gage, Stepen A and Spiers, Hugo. Comparable human spatial memory distortions in physical, desktop virtual and immersive virtual environments, 2022 [bioRxiv].

*Authors contributed equally to this work

Coutrot, Antoine and Guyader, Nathalie, "Learning a time-dependent master saliency map from eye-tracking data in videos", 2016 [arXiv].

Journal Papers

[20] Coutrot, Antoine and Manley Ed and Goodroe, Sarah and Gahnstrom, Chris and Filomena, Gabriele and Yesiltepe, Demet and Dalton, Ruth C and Wiener, Jan and Hölscher, Christoph and Hornberger, Michael and Spiers, Hugo. Entropy of city street networks linked to future spatial navigation ability, Nature, Vol. 604, pp 104-110, 2022. [pdf ]

Abstract

The cultural and geographical properties of the environment have been shown to deeply influence cognition and mental health (1–6). Living near green spaces has been found to be strongly beneficial (7–11), and urban residence has been associated with a higher risk of some psychiatric disorders (12–14)— although some studies suggest that dense socioeconomic networks found in larger cities provide a buffer against depression (15). However, how the environment in which one grew up affects later cognitive abilities remains poorly understood. Here we used a cognitive task embedded in a video game (16) to measure non-verbal spatial navigation ability in 397,162 people from 38 countries across the world. Overall, we found that people who grew up outside cities were better at navigation. More specifically, people were better at navigating in environments that were topologically similar to where they grew up. Growing up in cities with a low street network entropy (for example, Chicago) led to better results at video game levels with a regular layout, whereas growing up outside cities or in cities with a higher street network entropy (for example, Prague) led to better results at more entropic video game levels. This provides evidence of the effect of the environment on human cognition on a global scale, and highlights the importance of urban design in human cognition and brain function.

[19] Quesque, François, Coutrot, Antoine and [57 co-authors] and Bertoux, Maxime, "Does culture shape our understanding of others' thoughts and emotions? An investigation across 12 countries", Neuropsychology 2022. [pdf ]

Abstract
Measures of social cognition have now become central in neuropsychology, being essential for early and differential diagnoses, follow-up and rehabilitation in a wide range of conditions. With the scientific world becoming increasingly interconnected, international neuropsychological and medical collaborations are burgeoning to tackle the global challenges that are mental health conditions. These initiatives commonly merge data across a diversity of populations and countries, while ignoring their specificity. Objective: In this context, we aimed to estimate the influence of participants’ nationality on social cognition evaluation. This issue is of particular importance as most cognitive tasks are developed in highly specific contexts, not representative of that encountered by the world’s population. Method: Through a large international study across 18 sites, neuropsychologists assessed core aspects of social cognition in 587 participants from 12 countries using traditional and widely used tasks. Results: Age, gender, and education were found to impact measures of mentalizing and emotion recognition. After controlling for these factors, differences between countries accounted for more than 20% of the variance on both measures. Importantly, it was possible to isolate participants’ nationality from potential translation issues, which classically constitute a major limitation. Conclusions: Overall, these findings highlight the need for important methodological shifts to better represent social cognition in both fundamental research and clinical practice, especially within emerging international networks and consortia.

[18] Spiers, Hugo and Coutrot, Antoine and Hornberger, Michael. Explaining World-Wide Variation in Navigation Ability from Millions of People: Citizen Science Project Sea Hero Quest, Topics in Cognitive Science, Early View, 2021. [pdf ]

Abstract
Navigation ability varies widely across humans. Prior studies have reported that being younger and male has an advantage for navigation ability. However, these studies have generally involved small numbers of participants from a handful of western countries. Here we review findings from our project Sea Hero Quest which used a video game for mobile and tablet devices to test 3.9 million people on their navigation ability, sampling across every nation state and from 18-99 years of age. Results revealed that the task has good ecological validity and across all countries sufficiently sampled (N=57), age is linked to a near linear decline in navigation ability from the early 20s. All countries showed a male advantage, but this varied considerably and could be partly predicted by gender inequality. We found that those reported growing up in a city were on average worse at navigating than those who grew up outside cities, and that navigation performance helped identify those at greater genetic-risk of Alzheimer's disease. We discuss the advantages and challenges of using a mobile app to study cognition and the future avenues for understanding individual differences in navigation ability arising from this research.

[17] de Maissin, Astrid and Vallée, Rémi and Flamant, Mathurin and Fondain-Bossiere, Marie and Le Berre, Catherine and Coutrot, Antoine and Normand, Nicolas and Mouchère, Harold and Coudol, Sandrine and Trang, Caroline and Boureille, Arnaud. Multi-expert annotation of Crohn's disease images of the small bowel for automatic detection using a convolutional recurrent attention neural network, Endoscopy International Open, Vol. 9, pp E1136 - E1144, 2021. [pdf ]

Abstract
Background and study aims- Computer-aided diagnostic tools using deep neural networks are efficient for detection of lesions in endoscopy but require a huge number of images. The impact of the quality of annotation has not been tested yet. Here we describe a multi-expert annotated da- taset of images extracted from capsules from Crohn’s disease patients and the impact of the quality of annotations on the accuracy of a recurrent attention neural network.

Methods- Images of capsule were annotated by a reader first and then reviewed by three experts in inflammatory bowel disease. Concordance analysis between experts was evaluated by Fleiss’ kappa and all the discordant images were, again, read by all the endoscopists to obtain a consensus annotation. A recurrent attention neural network developed for the study was tested before and after the consensus annotation. Available neural networks (ResNet and VGGNet) were also tested under the same conditions.

Results- The final dataset included 3498 images with 2124 non-pathological (60.7%), 1360 pathological (38.9%), and 14 (0.4%) inconclusive. Agreement of the experts was good for distinguishing pathological and non-pathological images with a kappa of 0.79 (P<0.0001). The accuracy of our classifier and the available neural networks increased after the consensus annotation with a precision of 93.7%, sensitivity of 93 %, and specificity of 95 %.

Conclusions- Theaccuracyoftheneuralnetworkincreased with improved annotations, suggesting that the number of images needed for the development of these systems could be diminished using a well-designed dataset.

[16] Yesiltepe, Demet and Ozbil Torun, Ayse and Coutrot, Antoine and Hornberger, Michael and Spiers, Hugo and Conroy Dalton, Ruth, "Computer models of saliency alone fail to predict subjective visual attention to landmarks during observed navigation", Spatial Cognition & Computation, 2020. [pdf]

Abstract

This study aimed to understand whether or not computer models of saliency could explain landmark saliency. An online survey was conducted and participants were asked to watch videos from a spatial navigation video game (Sea Hero Quest). Participants were asked to pay attention to the environments within which the boat was moving and to rate the perceived saliency of each landmark. In addition, state-of-the-art computer saliency models were used to objectively quantify landmark saliency. No significant relationship was found between objective and subjective saliency measures. This indicates that during passive observation of an environment while being navigated, current automated models of saliency fail to predict subjective reports of visual attention to landmarks.

[15] Porffy, Lilla and Bell, Victoria and Coutrot, Antoine and Wigton, Rebekah and D'Oliveira, Teresa and Mareschal, Isabelle and Shergill, Sukhwinder, "Oxytocin effects on eye movements in schizophrenia", Schizophrenia Research, Vol. 216, pp 279-287, 2020. [pdf ]

Abstract

Background: Individuals with schizophrenia have difficulty in extracting salient information from faces. Eye-tracking studies have reported that these individuals demonstrate reduced exploratory viewing behaviour (i.e.reduced number offixations and shorter scan paths) compared to healthy controls. Oxytocin has previouslybeen demonstrated to exert pro-social effects and modulate eye gaze during face exploration. In this study, wetested whether oxytocin has an effect on visual attention in patients with schizophrenia.

Methods: Nineteen male participants with schizophrenia received intranasal oxytocin 40UI or placebo in adouble-blind, placebo-controlled, crossover fashion during two visits separated by seven days. They engagedin a free-viewing eye-tracking task, exploring images of Caucasian men displaying angry, happy, and neutralemotional expressions; and control images of animate and inanimate stimuli. Eye-tracking parameters included:total number offixations, mean duration offixations, dispersion, and saccade amplitudes.

Results: We found a main effect of treatment, whereby oxytocin increased the total number offixations, disper-sion, and saccade amplitudes, while decreasing the duration offixations compared to placebo. This effect, how-ever, was non-specific to facial stimuli. When restricting the analysis to facial images only, we found the sameeffect. In addition, oxytocin modulatedfixation rates in the eye and nasion regions.

Discussion: This is thefirst study to explore the effects of oxytocin on eye gaze in schizophrenia. Oxytocin had en-hanced exploratory viewing behaviour in response to both facial and inanimate control stimuli. We suggest thatthe acute administration of intranasal oxytocin may have the potential to enhance visual attention inschizophrenia

[14] Coughlan, Gillian and Coutrot, Antoine and Khondoker, Mizanur and Minihane, Anne Marie and Spiers, Hugo and Hornberger, Michael. Toward personalized cognitive diagnostics of at-genetic-risk Alzheimer’s disease, PNAS, Vol. 116, No 19, pp 9285-9292, 2019 [pdf ]

Abstract

Spatial navigation is emerging as a critical factor in identifying preclinical Alzheimer’s disease (AD). However, the impact of interindi- vidual navigation ability and demographic risk factors (e.g., APOE, age, and sex) on spatial navigation make it difficult to identify persons “at high risk” of AD in the preclinical stages. In the current study, we use spatial navigation big data (n = 27,108) from the Sea Hero Quest (SHQ) game to overcome these challenges by investigating whether big data can be used to benchmark a highly phenotyped healthy aging laboratory cohort into high- vs. low-risk persons based on their genetic (APOE) and demographic (sex, age, and educational attain- ment) risk factors. Our results replicate previous findings in APOE e4 carriers, indicative of grid cell coding errors in the entorhinal cortex, the initial brain region affected by AD pathophysiology. We also show that although baseline navigation ability differs between men and women, sex does not interact with the APOE genotype to influence the manifestation of AD-related spatial disturbance. Most importantly, we demonstrate that such high-risk preclinical cases can be reliably distinguished from low-risk participants using big-data spatial naviga- tion benchmarks. By contrast, participants were undistinguishable on neuropsychological episodic memory tests. Taken together, we pre- sent evidence to suggest that, in the future, SHQ normative bench- mark data can be used to more accurately classify spatial impairments in at-high-risk of AD healthy participants at a more individual level, therefore providing the steppingstone for individualized diagnostics and outcome measures of cognitive symptoms in preclinical AD.

[13] Coutrot, Antoine and Schmidt, Sophie and Coutrot, Lena and Pittman, Jessica and Hong, Lynne and Wiener, Jan and Hölscher, Christoph and Dalton Ruth C and Hornberger, Michael and Spiers, Hugo. Virtual navigation tested on a mobile app is predictive of real-world navigation performance, PLoS ONE, Vol. 14, No. 3, e0213272, 2019 [pdf].

Abstract

Virtual reality environments presented on tablets and smartphones have potential to aid the early diagnosis of conditions such as Alzheimer’s dementia by quantifying impairments in navigation performance. However, it is unclear whether performance on mobile devices can predict navigation errors in the real world. We compared the performance of 49 partici- pants (25 females, 18-35 years old) at wayfinding and path integration tasks designed in our mobile app ‘Sea Hero Quest’ with their performance at similar tasks in a real-world environment. We first performed this experiment in the streets of London (UK) and replicated it in Paris (France). In both cities, we found a significant correlation between virtual and real- world wayfinding performance and a male advantage in both environments, although smaller in the real world (Cohen’s d in the game = 0.89, in the real world = 0.59). Results in London and Paris were highly similar, and controlling for familiarity with video games did not change the results. The strength of the correlation between real world and virtual environment increased with the difficulty of the virtual wayfinding task, indicating that Sea Hero Quest does not merely capture video gaming skills. The fact that the Sea Hero Quest way- finding task has real-world ecological validity constitutes a step toward controllable, sensitive, safe, low-cost, and easy to administer digital cognitive assessment of navigation ability.

[12] Coutrot, Antoine and Silva, Ricardo and Manley Ed and de Cothi, Will and Sami, Saber and Bohbot, Véronique and Wiener, Jan and Hölscher, Christoph and Dalton, Ruth C and Hornberger, Michael and Spiers, Hugo. "Global determinants of navigation ability", Current Biology, Vol. 28, No 17, pp 2861-2866, 2018 [pdf].

Abstract

Human spatial ability is modulated by a number of factors including age and gender. While a few studies showed that culture influences cognitive strategies, the interaction between these factors has never been globally assessed as this requires testing millions of people of all ages across many different countries in the world. Since countries vary in their geographical and cultural properties, we predicted that these variations give rise to an organized spatial distribution of cognition at a planetary-wide scale. To test this hypothesis we developed a mobile-app-based cognitive task, measuring non-verbal spatial navigation ability in more than 2.5 million people, sampling populations in every nation state. We focused on spatial navigation due to its universal requirement across cultures. Using a clustering approach, we find that navigation ability is clustered into five distinct, yet geographically related, groups of countries. Specifically, the economic wealth of a nation was predictive of the average navigation ability of its inhabitants, and gender inequality was predictive of the size of performance difference between males and females. Thus, cognitive abilities, at least for spatial navigation, are clustered according to economic wealth and gender inequalities globally, which has significant implications for cross-cultural studies and multi-centre clinical trials using cognitive testing.

[11] Harrison, Charlotte and Binetti, Nicola and Coutrot, Antoine and Johnston, Alan and Mareschal, Isabelle, "Personality traits do not predict how we look at faces", Perception, Vol. 47, No 9, pp 976-984, 2018 [link].

Abstract

While personality has typically been considered to influence gaze behaviour, literature relating to the topic is mixed. Previously, we (Binetti et al. 2016) found no evidence of self-reported personality traits on preferred gaze duration between a participant and a person looking at them via a video. In the current study, 77 out of the original 498 participants answered an in-depth follow-up survey containing a more comprehensive assessment of personality traits (Big Five Inventory) than was initially used, to check whether earlier findings were caused by the personality measure being too coarse. In addition to preferred mutual gaze duration, we also examined two other factors linked to personality traits: number of blinks and total fixation duration in the eye region of observed faces. Using a multiple regression analysis we found that overall, personality traits do not predict how we look at faces, with the exception of openness being only weakly correlated with preferred amount of eye contact. We suggest that effects previously reported in the literature may stem from contextual differences and/or modulation of arousal.

[10] Rider, Andrew and Coutrot, Antoine and Pellicano, Elizabeth and Dakin, Steven and Mareschal, Isabelle, "Semantic content outweighs low-level saliency in determining children’s and adult's fixation of movies", Journal of Experimental Child Psychology, Vol. 166, pp 293-309, 2018 [pdf].

Abstract

To make sense of the visual world, we need to move our eyes to focus regions of interest on the high-resolution fovea. Eye movements, therefore, give us a way to infer mechanisms of visual processing and attention allocation. Here, we examined age-related differences in visual processing by recording eye movements from 37 children (aged 6–14 years) and 10 adults while viewing three 5- min dynamic video clips taken from child-friendly movies. The data were analyzed in two complementary ways: (a) gaze based and (b) content based. First, similarity of scanpaths within and across age groups was examined using three different measures of variance (dispersion, clusters, and distance from center). Second, content-based models of fixation were compared to determine which of these provided the best account of our dynamic data. We found that the variance in eye movements decreased as a function of age, suggesting common attentional orienting. Comparison of the different models revealed that a model that relies on faces generally performed better than the other models tested, even for the youngest age group (<10 years). However, the best predictor of a given participant’s eye movements was the average of all other participants’ eye movements both within the same age group and in different age groups. These findings have implications for understanding how children attend to visual information and highlight similarities in viewing strategies across development.

[9] Le Meur, Olivier and Coutrot, Antoine and Le Roch, Adrien and Helo, Andrea and Rama, Pia and Liu, Zhi, "Visual attention saccadic models learn to emulate the evolution of gaze patterns from childhood to adulthood", IEEE Transactions on Image Processing, Vol. 26, No 10, pp 4777-4789, 2017. [pdf].

Abstract

How people look at visual information reveals fundamental information about themselves, their interests and their state of mind. While previous visual attention models output static 2-dimensional saliency maps, saccadic models aim to predict not only where observers look at but also how they move their eyes to explore the scene. Here we demonstrate that saccadic models are a flexible framework that can be tailored to emulate observer's viewing tendencies. More specifically, we use the eye data from 101 observers split in 5 age groups (adults, 8-10 y.o., 6-8 y.o., 4-6 y.o. and 2 y.o.) to train our saccadic model for different stages of the development of the human visual system. We show that the joint distribution of saccade amplitude and orientation is a visual signature specific to each age group, and can be used to generate age-dependent scanpaths. Our age-dependent saccadic model not only outputs human-like, age-specific visual scanpath, but also significantly outperforms other state-of-the-art saliency models. In this paper, we demonstrate that the computational modelling of visual attention, through the use of saccadic model, can be efficiently adapted to emulate the gaze behavior of a specific group of observers.

[8] Coutrot, Antoine and Hsiao, Janet and Chan, Antoni, "Scanpath modeling and classification with Hidden Markov Models", Behavior Research Methods, pp 1-18, 2017. [pdf]

Abstract

How people look at visual information reveals fundamental information about them; their interests and their states of mind. Previous studies showed that scanpath, i.e. the sequence of eye movements made by an observer exploring a visual stimulus, can be used to infer observer-related (e.g. task at hand) and stimuli-related (e.g. image semantic category) information. However, eye movements are complex signals and many of these studies rely on limited gaze descriptors and bespoke datasets. Here, we provide a turnkey method for scanpath modeling and classification. This method relies on variational Hidden Markov Models (HMMs) and Discriminant Analysis (DA). HMMs encapsulate the dynamic and individualistic dimensions of gaze behavior, allowing DA to capture systematic patterns diagnostic of a given class of observers and/or stimuli. We test our approach on two very different datasets. Firstly, we use fixations recorded while viewing 800 static natural scene images, and infer an observer-related characteristic: the task at hand. We achieve an average of 55.9% correct classification rate (chance = 33\%). We show that correct classification rates positively correlate with the number of salient regions present in the stimuli. Secondly, we use eye positions recorded while viewing 15 conversational videos, and infer a stimulus-related characteristic: the presence or absence of original soundtrack. We achieve an average 81.2% correct classification rate (chance = 50%). HMMs allow to integrate bottom-up, top-down and oculomotor influences into a single model of gaze behavior. This synergistic approach between behaviour and machine learning will open new avenues for simple quantification of gazing behaviour. We release SMAC with HMM, a Matlab toolbox freely available to the community under an open-source license agreement.

[7] Coutrot, Antoine and Binetti, Nicola and Harrison, Charlotte and Mareschal, Isabelle and Johnston, Alan, "Face exploration dynamics differentiate men and women", Journal of Vision, Vol. 16, No 14, pp 1-19, 2016. [pdf]

Abstract

The human face is central to our everyday social interactions. Recent studies have shown that while gazing at faces, each one of us has a particular eye-scanning pattern, highly stable across time. Although variables such as culture or personality have been shown to modulate gaze behaviour, we still don't know what shapes these idiosyncrasies. Moreover most previous observations rely on static analyses of small-sized eye-position datasets averaged across time. Here, we probe the temporal dynamics of gaze to explore what information can be extracted about the observers and what is being observed. Controlling for any stimuli effect, we demonstrate that amongst many individual characteristics, the gender of both the participant (gazer) and the person being observed (actor) are the factors that most influence gaze patterns during face exploration. We record and exploit the largest set of eye tracking data (405 participants, 58 nationalities) from participants watching videos of another person. Using novel data-mining techniques, we show that female gazers follow a much more exploratory scanning strategy than males. Moreover, female gazers watching female actresses look more at the eye on the left side. These results have strong implications in every field using gaze-based models, from computer-vision to clinical psychology.

[6] Binetti, Nicola and Harrison, Charlotte* and Coutrot, Antoine* and Mareschal, Isabelle and Johnston, Alan, "Pupil dilation as an index of preferred mutual gaze duration", Royal Society Open Science, Vol. 3, No 160086, pp 1-11, 2016.

(* Authors contributed equally to this work). [pdf] This work has been highlighted in Science.

Most animals look at each other to signal threat or interest. In humans, this social interaction is usually punctuated with brief periods of mutual eye contact. Deviations from this pattern of gazing behaviour generally make us feel uncomfortable and are a defining characteristic of clinical conditions such as autism or schizophrenia, yet it is unclear what constitutes normal eye contact. Here, we measured, across a wide range of ages, cultures and personality types, the period of direct gaze that feels comfortable and examined whether autonomic factors linked to arousal were indicative of people’s preferred amount of eye contact. Surprisingly, we find that preferred period of gaze duration is not dependent on fundamental characteristics such as gender, personality traits or attractiveness. However, we do find that subtle pupillary changes, indicative of physiological arousal, correlate with the amount of eye contact people find comfortable. Specifically, people preferring longer durations of eye contact display faster increases in pupil size when viewing another person than those preferring shorter durations. These results reveal that a person’s preferred duration of eye contact is signalled by physiological indices (pupil dilation) beyond volitional control that may play a modulatory role in gaze behaviour.

[5] Coutrot, Antoine and Guyader, Nathalie, "Learning a time-dependent master saliency map from eye-tracking data in videos", 2016 [arXiv].

Abstract

To predict the most salient regions of complex natural scenes, saliency models commonly compute several feature maps (contrast, orientation, motion...) and linearly combine them into a master saliency map. Since feature maps have different spatial distribution and amplitude dynamic ranges, determining their contributions to overall saliency remains an open problem. Most state-of-the-art models do not take time into account and give feature maps constant weights across the stimulus duration. However, visual exploration is a highly dynamic process shaped by many time-dependent factors. For instance, some systematic viewing patterns such as the center bias are known to dramatically vary across the time course of the exploration. In this paper, we use maximum likelihood and shrinkage methods to dynamically and jointly learn feature map and systematic viewing pattern weights directly from eye-tracking data recorded on videos. We show that these weights systematically vary as a function of time, and heavily depend upon the semantic visual category of the videos being processed. Our fusion method allows taking these variations into account, and outperforms other state-of-the-art fusion schemes using constant weights over time. The code, videos and eye-tracking data we used for this study are available online: http://antoinecoutrot.magix.net/public/research.html

[4] Le Meur, Olivier and Coutrot, Antoine, "Introducing context-dependent and spatially-variant viewing biases in saccadic models", Vision Research, Vol. 121, pp 72-84, 2016. [pdf] (Authors contributed equally to this work)

Previous research showed the existence of systematic tendencies in viewing behavior during scene exploration. For instance, saccades are known to follow a positively skewed, long-tailed distribution, and to be more frequently initiated in the horizontal or vertical directions. In this study, we hypothesize that these viewing biases are not universal, but are modulated by the semantic visual category of the stimulus. We show that the joint distribution of saccade amplitudes and orientations significantly varies from one visual category to another. These joint distributions are in addition spatially variant within the scene frame. We demonstrate that a saliency model based on this better understanding of viewing behavioral biases and blind to any visual information outperforms well-established saliency models. We also propose a saccadic model that takes into account classical low-level features and spatially-variant and context-dependent viewing biases. This model outperforms state-of-the-art saliency models, and provides scanpaths in close agreement with human behavior. The better description of viewing biases will not only improve current models of visual attention but could also influence many other applications such as the design of human-computer interfaces, patient diagnosis or image/video processing applications.

[3] Coutrot, Antoine and Guyader, Nathalie, "How Saliency, Faces and Sound influence gaze in Dynamic Social Scenes",

Journal of Vision, Vol. 14, No 8, pp 1-17, 2014. [pdf]

Abstract

Conversation scenes are a typical example in which classical models of visual attention dramatically fail to predict eye positions. Indeed, these models rarely consider faces as particular gaze attractors and never take into account the important auditory information that always accompanies dynamic social scenes. We recorded the eye movements of participants viewing dynamic conversations taking place in various contexts. Conversations were seen either with their original soundtracks or with unrelated soundtracks (unrelated speech and abrupt or continuous natural sounds). First, we analyze how auditory conditions influence the eye movement parameters of participants. Then, we model the probability distribution of eye positions across each video frame with a statistical method (Expectation- Maximization), allowing the relative contribution of different visual features such as static low-level visual saliency (based on luminance contrast), dynamic low- level visual saliency (based on motion amplitude), faces, and center bias to be quantified. Through experimental and modeling results, we show that regardless of the auditory condition, participants look more at faces, and especially at talking faces. Hearing the original soundtrack makes participants follow the speech turn-taking more closely. However, we do not find any difference between the different types of unrelated soundtracks. These eye- tracking results are confirmed by our model that shows that faces, and particularly talking faces, are the features that best explain the gazes recorded, especially in the original soundtrack condition. Low-level saliency is not a relevant feature to explain eye positions made on social scenes, even dynamic ones. Finally, we propose groundwork for an audiovisual saliency model.

[2] Coutrot, Antoine and Guyader, Nathalie and Ionescu, Gelu and Caplier, Alice, "Video viewing: do auditory salient events capture visual attention?", Annals of Telecommunications, Vol. 69, No. 1 pp 89-97, 2014. [pdf]

Abstract

We assess whether salient auditory events contained in soundtracks modify eye movements when exploring videos. In a previous study, we found that on average, non-spatial sound contained in video soundtracks impacts on eye movements. This result indicates that sound could play a leading part in visual attention models to predict eye movements. In this research, we go further and test whether the effect of sound on eye movements is stronger just after salient auditory events. To automatically spot salient auditory events, we used two auditory saliency models: the Discrete Energy Separation Algorithm and the Energy model. Both models provide a saliency time curve, based on the fu- sion of several elementary audio features. The most salient auditory events were extracted by thresholding these curves. We examined some eye movements parameters just after these events rather than on all the video frames. We showed that the effect of sound on eye movements (variability between eye positions, saccade amplitude and fixation duration) was not stronger after salient auditory events than on average over entire videos. Thus, we suggest that sound could impact on visual exploration not only after salient events but in a more global way.

[1] Coutrot, Antoine and Guyader, Nathalie and Ionescu, Gelu and Caplier, Alice, "Influence of soundtrack on eye movements during video exploration", Journal of Eye Movement Research, Vol. 5, No. 4, pp 1-10, 2012. [pdf]