Abstract
The well-known “cocktail party effect” refers to incidental detection of salient words, such as one's own-name, in supposedly unattended speech. However, empirical investigation of the prevalence of this phenomenon and the underlying mechanisms has been limited to extremely artificial contexts and has yielded conflicting results. We introduce a novel empirical approach for revisiting this effect under highly ecological conditions, by immersing participants in a multisensory Virtual Café and using realistic stimuli and tasks. Participants (32 female, 18 male) listened to conversational speech from a character at their table, while a barista in the back of the café called out food orders. Unbeknownst to them, the barista sometimes called orders containing either their own-name or words that created semantic violations. We assessed the neurophysiological response-profile to these two probes in the task-irrelevant barista stream by measuring participants' brain activity (EEG), galvanic skin response and overt gaze-shifts.
SIGNIFICANCE STATEMENT We found distinct neural and physiological responses to participants' own-name and semantic violations, indicating their incidental semantic processing despite being task-irrelevant. Interestingly, these responses were covert in nature and gaze-patterns were not associated with word-detection responses. This study emphasizes the nonexclusive nature of attention in multimodal ecological environments and demonstrates the brain's capacity to extract linguistic information from additional sources outside the primary focus of attention.
Introduction
The cognitive and neural mechanisms that enable people to successfully focus on one speaker of interest in multitalker environments have been a matter of ongoing interest for decades (Cherry, 1953; Bronkhorst, 2015; Qian et al., 2018). A major point of theoretical debate has been the extent to which speech that is irrelevant to the listener, and which they supposedly try to ignore, is nonetheless encoded and processed. Sparking this debate are conflicting behavioral and neural results. Many studies report evidence for encoding task-irrelevant speech only at a sensory level (Näätänen et al., 1997; Escera et al., 2003; Paavilainen, 2013; Ding et al., 2018; Parmentier et al., 2019), in line with early selection models of attention (Broadbent, 1965). However, others show evidence that listeners can extract some linguistic and/or semantic attributes from task-irrelevant speech (Moray, 1959; Lachter et al., 2004; Bronkhorst, 2015; Röer et al., 2017a,b; Brodbeck et al., 2020; Har-shai Yahav and Zion Golumbic, 2021), which is more in line with late-selection (Deutsch and Deutsch, 1963) and attenuation models of attention (Treisman, 1964). These include priming and intrusion effects of irrelevant speech (Dupoux et al., 2003; Rivenez et al., 2008; Dai et al., 2022), neural responses to linguistic features (Olguin et al., 2018; Brodbeck et al., 2020; Har-shai Yahav and Zion Golumbic, 2021; Holtze et al., 2021), as well as explicit detection of semantically salient words (Van Petten and Luka, 2012; Röer et al., 2017a,b), including, famously, one's own-name (Moray, 1959; Wood and Cowan, 1995; Conway et al., 2001; Tamura et al., 2012; Tateuchi et al., 2012; Naveh-Benjamin et al., 2014; Röer and Cowan, 2021).
Part of the reason this matter is still highly debated may be the variable and artificial nature of the paradigms used to test semantic processing of irrelevant speech, and the reliance on indirect measures. For example, many experiments investigating the processing of irrelevant speech do this using arbitrary lists of words that lack contextual continuity and test for behavioral intrusions of task-irrelevant stimuli to target-detection or short-term memory tasks (Lewis, 1970; Bentin et al., 1995; Wood and Cowan, 1995; Conway et al., 2001; Tun et al., 2002; Naveh-Benjamin et al., 2014; Vanthornhout et al., 2019). Although these studies offer insights into some aspects of the competition for processing concurrent stimuli, they are far removed from the listening challenges and tasks experienced in real-life situations, limiting their ecological validity (Holleman et al., 2020).
The goal of the current study was to investigate this long-standing question under ecological conditions, using speech-stimuli, tasks, and simulating an environment that listeners actually encounter in real-life. To this end, we immersed participants in a Virtual Café where they experienced conversing with a partner (“target speaker”) while also hearing speech from other characters in the café (Shavit-Cohen and Zion Golumbic, 2019). Specifically, we placed a task-irrelevant barista-character in the back of the café who called out sequences of orders (e.g., “salad for Sarah”). Critically, we manipulated the content of this task-irrelevant barista stream in two ways: (1) on some occasions, the orders called contained the participants' own-name; or (2) a semantic violation (e.g., “coffee for salad”). We sought to circumvent reliance on subjective recall of task-irrelevant words (Röer and Cowan, 2021) or indirect assessment of their interference with task performance (Ljungberg et al., 2014; Naveh-Benjamin et al., 2014), and provide more direct indication of semantic processing of task-irrelevant speech. To this end, we measured participants' neural activity, gaze-dynamics, and galvanic skin response (GSR) as they engaged in listening to their conversation partner in the Virtual Café, and assessed the neurophysiological response-profile to the own-name and semantic violations probes, that were always task-irrelevant. This unique multidimensional setup allowed us to test whether there are covert or overt indications for semantic processing of task-irrelevant speech, to uncover their neural and physiological correlates and to revisit the proposed special affinity for detecting one's own-name, in an ecologically relevant context where these stimuli may actually occur in real-life.
Materials and Methods
Participants
Fifty adults participated in this study (ages 19-34, median 23; 32 female, 18 male, 10 left-handed), all fluent in Hebrew, with self-reported good eyesight or corrected by contact lenses, normal hearing, and no history of neurologic disorders. The number of participants was determined a priori based on power analysis of results from a previous EEG study using a similar paradigm where neural responses were compared in response to hearing ones' own-name versus a control-name (Pinto et al., 2023). We used the smaller own-name effect reported in that study (d′ = 0.36; mean effect size = 1.17; SD = 1.5) to estimate the number of participants for the current study. Power analysis using the g*power software indicated that a minimum of 22 participants are needed to replicate this effect with a power estimate of 0.95. Since the current experiment contained an additional condition (semantic violation) and also used a selective attention paradigm, we set the target recruitment number at 50 participants, more than double the minimal sample required. Because of technical issues and excessive artifacts, some participants were excluded from the analyses of different metrics (5 excluded from EEG analysis, 7 excluded from GSR analysis, 11 excluded from eye-gaze analysis).
Seven of the remaining participants reported being previously diagnosed with ADD/ADHD, of them, two reported taking medication occasionally and one on a regular basis. These participants were not excluded from the experiment to maintain a more representative sample of the general population, in which the prevalence of ADHD is ∼12%-15% (Cohen et al., 2013) similar to the proportion in the current sample. In Extended Data Figure 2-1, we show that results from the subset of individuals with ADHD fell within the distribution of results from the rest of the sample, indicating that their inclusion did not introduce additional variability to our dependent measures. This is consistent with results from several recent meta-analyses showing small or no systematic differences between individuals with and without ADHD on many laboratory-based neural measures of selective attention (Kaiser et al., 2020; Faraone et al., 2021), which is likely because of the vast heterogeneity of this disorder and its diagnosis. We believe that the generalizability of the current results is enhanced by including these individuals in the group results and by offering a transparent report regarding where they fall within the sample distribution.
Figure 2-1
Histograms showing the distribution of the dependent measures tested across participants. Color-coding distinguishes among participants who reported having been diagnosed with ADD/ADHD (red) and versus those without a diagnosis (gray). In all measures, individuals with ADHD fall well within the distribution of the general sample, and do not skew systematically to the edge of the distributions. This suggests that, in the current dataset, individuals with ADHD were not distinctly different than the rest of the sample, and their inclusion in the group-level data did not increase the overall variability or skew the results. Download Figure 2-1, TIF file.
Signed informed consent was obtained from each participant before the experiment, in accordance with the guidelines of the Institutional Ethics Committee at Bar-Ilan University. Participants were paid for participation or received class credit.
Apparatus
Participants were seated in an acoustic-shielded room and viewed a 3D virtual-reality (VR) scene of a café, through a head-mounted device (Oculus Rift Development Kit 2). The virtual environment, which consisted of several avatars sitting at various locations within the café, was programmed and presented using the software Unity (Fig. 1A). Two of the avatars (the target speaker and the barista; see below) were speaking, with their speech-audio manipulated so it was perceived as originating from their location in virtual space (3D sound algorithm, Unity). The audio was presented through in-ear earphones (with foam inserts; Etymotic ER-1), and both the graphic display and 3D sound were adapted online in accordance with head position, to maintain a spatially coherent experience of the virtual space. The articulation movements of the speaking avatars were synchronized to the envelope of the speech, to create a realistic audiovisual experience. Avatar eye movements were generated to appear natural. Additional animations of avatar body movements were selected from the Mixamo platform (Adobe Systems), adapted to the scene and randomly looped to avoid noticeable exact repetitions.
A, Illustration of the Virtual Café scenario. Top, A wide-view perspective of the café, as seen by the participants. The central area mark in red is the default visual field, whereas seeing the areas to the left and right require head movements. Bottom, Demarcation of the ROIs used for analyzing eye-gaze patterns (areas unmarked in the figure were considered to be the “Floor”). Yellow represents ROIs that contain moving avatars. Circular inset, Participant wearing the virtual-reality headset over the EEG cap. B, Example of the narrative speech spoken by the target speaker. Each trial was followed by a word recognition test, in which participants were asked to identify words that had been present in the target-speech, out of a 9 word-matrix. C, Illustration of the structure of the barista speech. This stimulus consisted of order-sentences, which contained probe-words (own-name and semantic violation) and their respective controls.
The VR device was custom-fitted with an embedded eye-tracker (SMI; 60 Hz monocular sampling rate) for continuous monitoring of participants' eye-gaze position. Gaze location was projected onto the virtual environment and was recorded as coordinates in 3D space, as well as in terms of which object in the virtual environment participants looked at (e.g., target-avatar, different avatars, ceiling etc.).
Neural activity was recorded using a 64 electrodes EEG system (BioSemi; sampling rate: 1024 Hz) and a standard EEG cap. Two external electrodes were placed on the mastoids and served as reference channels. Electro-oculographic signals were simultaneously measured by three additional electrodes, located above the right eye and on the external side of both eyes. Figure 1A (circular inset) shows the placement of the VR head device over the EEG cap. GSR was also measured using two passive Nihon Kohden electrodes placed on the fingertips of the index and middle fingers of participants' nondominant hand. The signal was recorded through the BioSemi system amplifier and was synchronized to the sampling rate of the EEG.
Stimuli
The virtual scene included the following avatar characters: a character sitting at a table facing the participant (target speaker), a barista standing at the bar in the back of the café, slightly to the right, and four additional characters sitting at tables to the left and right of the target speaker. The latter characters did not speak; however, their body motions were slightly animated to preserve a realistic feel to the environment. The target speaker and the barista were both presented as speaking.
The target speaker's speech (Fig. 1B) consisted of segments of natural Hebrew narratives prerecorded by a female actor. To encourage spontaneous conversational-style speech, the actor was given a series of prompts and was asked to speak about them for ∼40 s (timer shown on screen). The prompts referred primarily to daily experiences and could be of either a personal nature (e.g., “describe a happy childhood memory”) or more informational (e.g., “describe how to prepare your favorite meal”). The actor was given the list of prompts in advance to plan their answers; however, the narratives themselves were delivered in a spontaneous, unscripted fashion. The amplitude of all narrative recordings was equated to the same SPL level using the software PRAAT; https://www.fon.hum.uva.nl/praat/. In total, we used 32 narrative recordings, ranging between 43 and 49 s in length. Each narrative was presented only once during the experiment, and the first two narratives were used only as training.
The barista's speech (Fig. 1C) consisted of lists of orders and filler sentences that might be heard in a café. The stimuli were constructed from single words, recorded individually by a female actor, in no systematic order. Recordings of individual words were cut and equated to the same SPL level as the target speech using PRAAT, and then concatenated (using MATLAB) to create order-sentences that consist of a person's name and a food item they had ordered. Order-sentences had two possible syntactic structures, which in English correspond to “Salad for Sarah,” or “Sarah's salad is ready.” Importantly, in Hebrew, in both of these structures, the name appears in the third word position of the sentence (“Salat bishvil Sarah” vs “hasalat shel Sarah mukhan”), and hence its occurrence is similarly predictable. In addition to order-sentences, additional filler sentences were constructed that did not contain names and are contextually relevant for a café environment (e.g., “Clean-up needed at table five,” “Lunch served until four”). In all sentences, words were presented with a constant between-word interval of 200 ms. Individual sentences were further concatenated into streams, with a 400 ms between-sentence interval to create streams of 37-41 s in length (each containing 13 sentences).
Some items in the barista stream were manipulated, in accordance with our two research questions, serving as probe-words (Fig. 1C): (1) own-name — some order-sentences contained the participant's own-name as the subject of the sentence. Own-names were presented in accordance with participants' preferred pronunciation, based on verification during a prescreening phone call. Two potential control-names were chosen for each participant, which were the names of the two previous participants. These potential control-names occurred with equal probability as the own-name. In post-experiment debriefing, participants were asked about their familiarity with the two potential control-names, and the less familiar control-name was chosen for subsequent analyses. (2) Semantic violation — in some order-sentences, a food item was presented instead of the expected name (e.g., “Coffee for salad”), forming a semantically incongruent sentence. Eighteen different food item words were used as semantic violation probe-words. As a control, we chose 18 names that were matched to the food items in their duration, loudness level, number of open and closed syllables, and stress position. Semantic violation sentences were compared with regular order-sentences using these control-names, which made them perfectly identical in terms of syntactic structure and semantic expectation before the probe word.
For each participant, 30 personalized barista speech streams were constructed, to match the number of target-speech trials. As mentioned above, each barista stream was comprised of 13 concatenated order-sentences; of them, two contained an own-name probe, two contained semantic violations, six contained the various control probes (control-names and semantic controls), and there were three additional filler sentences. Thus, in total, each participant was presented with 60 repetitions of each probe: own-name, semantic violation, and their respective controls. Order-sentences were concatenated in pseudo-random order to ensure that sentences with own-name or semantic violation probes were separated by at least two control sentences and were not the first or last in the sequence.
The target-speech was presented through in-ear earphones, at a comfortable loudness level, that was determined during the training trial. The precise intensity level cannot be measured reliably, since it is affected by the specific positioning of the foam earphone inserts in the participants' ear, but was roughly between 60-70 dB SPL. The barista speech was always presented at 50% intensity relative to the target-speech (−6 dB SPL relative to the target), and was additionally attenuated by a factor of 1/(5.5)2 = 0.033, to account for the spatial distance between then in the Virtual Café (as the barista was located 5.5 m behind the target speaker; Unity Audio Spatializer SDK).
Procedure
Participants were instructed to listen to content of the target speaker's narrative and performed a word-recognition task following each trial. They were told that they might hear additional stimuli in the café, however, that these stimuli were irrelevant. The word-recognition task following each trial consisted of presentation of a 9-word matrix, of which participants were asked to select all the words that they remember hearing in the target-speech (Fig. 1B). The number of correct words varied between 4 and 5, and participants received feedback on their responses after each trial (see calculation below). Proceeding to the next trial was controlled by the participants, and they were encouraged to take short breaks at regular intervals. Before the start of the experiment, participants were familiarized in two training trials with the Virtual Café scene, the task-relevant speaker, and the task. The training trials were also used to adjust the speech loudness to a comfortable level.
After completing the main task, participants were debriefed about their experience, and whether they noticed any stimuli that were out of the ordinary in the barista stream. Participants were also interviewed about their personal familiarity with people who share their own-name or one of the two potential control-names (to select the control-name that was least familiar).
Since previous studies have suggested that individual differences in the ability to notice your name (Conway et al., 2001, 2005; Colflesh and Conway, 2007) or be distracted by other information in irrelevant speech (Hughes, 2014; Sörqvist and Rönnberg, 2014; Lambez et al., 2020) may be mediated by working memory capacity (WMC), we added an additional WMC screening test after the main experiment. Although WMC can be operationalized in many ways, we chose the primary measure used in comparable studies of attention, the Operation Span Working memory task (OSPAN), which is based on a Shortened Version of the working memory task (Beaman, 2004; Sörqvist, 2010; Foster et al., 2015; Röer et al., 2017a,b). We used a Hebrew adaptation of this task (https://englelab.gatech.edu/translatedtasks.html#hebrew), in which participants are asked to solve simple arithmetic problems while also remembering a series of Hebrew letters. Each trial consisted of several (3-7) arithmetic problems, each followed by a single letter of the Hebrew alphabet. At the end of each trial, the participants were asked to recall the series of letters in the order they were presented. The test produces two main scores referring to the number of letters recalled and positioned correctly within the sequence (1) including only trials where all letters were correctly recalled (absolute OSPAN score) and (2) including all trials (partial OSPAN score). However, we used only the partial scores, as per recommendation of previous studies, since they make use of all available information (Friedman and Miyake, 2005; Redick et al., 2012; Đokić et al., 2018). Four participants were removed from this analysis because of low performance (<85% correct on the arithmetic problems; n = 3) or because of technical issues (n = 1).
Behavioral data analysis
The word recognition task consisted of indicating which of the 9 words were present in the target speaker's narrative (Fig. 1B). If participants recognized all the words correctly and made no false-alarms, they received a score of 9/9 = 100%. This score was penalized by −1/9 (11%) if a correct word was missed or if an incorrect word was marked. Hence, this behavioral score reflects sensitivity to recognizing the words from the narrative, combining both hits and correct rejections.
EEG analysis
EEG preprocessing and analysis were performed using the MATLAB-based FieldTrip toolbox (Oostenveld et al., 2011) as well as custom-written scripts. The raw data were filtered using a bandpass filter of 0.5-40 Hz, detrended, and de-meaned. Then the data were inspected visually and gross artifacts exceeding an absolute change of ±50 μV relative to the de-meaned signal (that were not eye movements) were removed. Independent component analysis was performed to identify and remove components associated with horizontal or vertical eye movements as well as heartbeats. Noisy electrodes that exhibited consistent high-frequency activity (>40 Hz) or excessive low-frequency DC drifts, likely because of bad or loose connectivity, were replaced with the weighted average of their neighbors using an interpolation procedure (either on the entire data set or on a per-trial basis, as needed). Five participants were excluded from the EEG analysis because of extreme artifacts, and data are reported from the remaining n = 45.
The main goal of this study was to investigate the neural response to words in the task-irrelevant barista stream, and particularly to test whether the two types of probe words (hearing one's own-name or the presence of semantic violations) elicit unique neural responses, which would be indicative of their implicit detection. Therefore, our EEG analysis focused on the time period immediately after the presentation of each probe word and their respective controls. The clean EEG data were segmented into epochs between −200 and 1500 ms around each probe word. We applied a 12 Hz lowpass filter to the data, to further smooth the single-trial data before the event-related potential (ERP)–Residue Iteration Decomposition (RIDE) analysis (see below), as we found that this analysis produces more interpretable results when the single trials are more similar in spectral makeup to the resulting ERP. Epoched data were baseline corrected relative to the mean activity in the time window between (−200 to 0 ms) before each probe.
Since the stimuli used as probe words were of varied lengths (range, 500-750 ms), and had different temporal profiles from each other, the classic method for deriving ERPs through simple averaging was not appropriate. Instead, we applied RIDE analysis (http://cns.hkbu.edu.hk/RIDE.htm) (Ouyang et al., 2011a,b), which accounts for latency differences across trials when extracting the average neural response. Specifically, the RIDE analysis decomposes the data into stimulus-locked (S), and general (C) components, and analyzes the trial-to-trial variability of these components to reconstruct an ERP. The time-windows defined for the S and C components were 0-400 ms and 300-1200 ms, respectively, values chosen based on visual inspection of the ERP grand-average (before RIDE estimation). In Extended Data Figures 3-1 and 4-1, we report additional RIDE analyses conducted using slightly different time-window selections; however, results are qualitatively similar. This analysis yielded a RIDE-ERP time course for each participant. These were then averaged across participants to derive a grand-average RIDE-ERP at each electrode. Statistical comparison of the responses to each probe-type (own-name and semantic violations) versus their controls was conducted by applying a t test at each time point between 200 and 1200 ms and correcting for multiple comparisons (cluster permutation test, with threshold-free cluster enhancement correction [TFCE]; implemented in the fieldtrip toolbox).
Figure 3-1
Top Figure: Classic Grand-average ERP to ones' own-name versus its control from electrode CPz. Bottom Figures: Grand-average RIDE-ERP to ones' own-name versus its control from electrode CPz, derived using different parameter choices (Ouyang et al., 2011a) (http://cns.hkbu.edu.hk/RIDE.htm). Left column: RIDE models that included a single C component. Right column: RIDE models that included two C components (C1 and C2). The time-windows defined for each component are stated above each figure. Shaded areas around each waveform indicate the standard error of the mean. Shaded vertical areas indicate the time periods in which significant differences were found between the two responses (p < 0.05, TFCE-corrected). The modulation of a negative peak around 400 ms in response to hearing ones' own-name was significant in all iterations of this analysis, supporting its robustness. Data from top-left panel is reported in the main paper. Download Figure 3-1, TIF file.
GSR analysis
Analysis of the GSR signal was performed using the MATLAB-based Ledalab toolbox (Benedek and Kaernbach, 2010), as well as custom-written scripts. The raw data were manually inspected for distinguishable artifacts, which were fixed using spline interpolation. Then a continuous decomposition analysis (CDA) was performed on the full GSR signal, and the time-locked response to individual probe words was extracted in windows between 0 and 5 s from the onset of each probe word. GSR responses to each probe word were baseline-corrected, relative to the mean response in the 1 s before each probe. The event-related GSR responses were averaged across trials, and a group-level grand-average was derived. Statistical analysis of the difference in responses to each probe-word versus its control was conducted by applying a paired t test to the mean amplitude within a 2 s time window surrounding the peak. Seven participants were excluded from the GSR analysis: five because of extreme artifacts and two because of technical issues. GSR results are reported for the remaining n = 43 participants.
Eye-gaze analysis
Eye-gaze raw data were preprocessed to remove data surrounding blinks. Although no gaze-data are recorded during a blink, residual gaze-data before or after a blink may have been picked up by the tracker and erroneously inserted into the data. Since blinks were not detected automatically by the eye-tracker, we used the electro-oculographic channel from the EEG recordings to detect the occurrence of blinks (starting with a threshold of 70 μV, and adjusting it for each participant, based on visual inspection). Gaze data that fell between 100 ms before and 200 ms after the blink were removed in accordance to suggested procedures for eye-tracking in VR (Anderson et al., 2023).
The 3D café scene was parsed into several visual ROIs, as shown in Figure 1A. These included the areas around the characters in the café (target speaker, left-pair, right-pair, barista) as well as regions of the café that did not contain characters (target speaker table, ceiling, floor, left restaurant, right restaurant, back of restaurant). Information regarding participants' momentary head position and direction of gaze vector were combined to determine which ROI participants were looking at, on a moment-to-moment basis. Analysis of eye-gaze patterns focused on how long participants stayed fixated within a certain ROI and detection of saccades between ROIs (which were defined as gaze-shifts between fixations in two different ROIs, each one lasting at least 80 ms) (Anderson et al., 2023). Saccades within ROIs were not analyzed since this was outside the scope of our research interest.
To assess whether name and semantic probes elicited overt gaze-shifts, we quantified the probability of performing a saccade away from the target speaker ROI either to any other ROI or specifically toward the Barista ROI, in 2-s-long epochs following the onset of probe words. We compared this with the probability of saccades in periods following the respective controls, which was evaluated statistically using paired t tests. We also performed exploratory correlational analyses to test whether individual differences in the frequency of spontaneous gaze-shifts away from the target speaker were related to the intersubject variability in any of other dependent measures tested here (Pearson's correlation vs accuracy on the recognition test, WMC, differences in the neural or GSR response to probes). Eleven participants were excluded from analysis of eye-gaze data because of technical difficulties with the eye-tracker, and data are reported for the remaining n = 39.
Results
Behavioral task
Word-recognition performance was relatively good, with an average accuracy of 85.27 ± 3.42% (mean ± SD; Fig. 2A), indicating that, for the most part, participants paid attention to the target speaker and correctly recognized words from the narrative. This measure of accuracy is affected both by hits and correct rejections, reflecting sensitivity. Performance on the task was not correlated with WMC (OSPAN partial; n = 46, R2 = 0.11, p = 0.49; Fig. 2B).
A, Box plot represents the mean behavioral performance on the word recognition task (for all the participants before any exclusions, n = 50). Error bar indicates SEM. B, Scatter plot of accuracy on the word recognition task versus WMC (OSPAN score), showing no significant correlation between them. Gray shaded area represents a 95% CI. Four participants were excluded from this analysis because of lack of a reliable OSPAN score (n = 46). Distribution of individual-level data, and separation between participants with an ADHD diagnosis are shown in Extended Data Figure 2-1.
Post-experiment debriefing
After completing the experiment, participants were interviewed about their experience. In response to the question “how was the experiment?” 94% voluntarily mentioned that they had heard their name in the barista stream, and 4% mentioned hearing sentences that did not make sense. Then, when asked directly “did you hear your name,” all but one participant recalled that they may have heard their own-name or a similar name. When asked directly “did everything you heard make sense?” 4% additional participants responded that they had noticed some portions of the speech that did not make sense, and additional 10% reported they felt like something was not right, but could not put a finger on it, as they “didn't really listen to the background.”
EEG results
We estimated the RIDE-ERPs in response to each type of probe-word and their respective controls (5 participants were excluded from this analysis; hence, data are reported for n = 45). Statistical analysis was performed on the RIDE-ERP grand-average across participants, an analysis that accounts for latency variability across trials. However, we also present the traditional ERP grand-average (shown in Figs. 3, 4, insets) for qualitative comparison of the resulting waveforms.
The neural response to own- and control-name probes (n = 45). A, RIDE-ERP grand-average, shown at electrode CPz (location indicated in white on the topographies in B, left). Shaded vertical areas represent the time periods in which significant differences were found between the response to one's own-name versus control-name (250-460; p < 0.05, TFCE-corrected). Shaded areas around waveforms represent SEM. Inset, Waveform of the classic ERP grand-average, before the RIDE analysis. B, Scalp topographies of the responses to the own- and control-name probes (left), and the difference between them (right), in the significant time window. Right, The electrodes highlighted in white represent a significant difference. For RIDE-ERPs derived using different parameters, see Extended Data Figure 3-1. Distribution of individual-level data, and separation between participants with an ADHD diagnosis are shown in Extended Data Figure 2-1.
The neural response to semantic violations versus control probes (n = 45). A, RIDE-ERP grand-average, shown at electrode Fz (location indicated in white on the topographies in B, C, left). Shaded vertical areas represent the time periods in which significant difference was found between the response to semantic violations versus their controls (p < 0.05, TFCE-corrected). Shaded areas around waveforms represent SEM. Inset, Waveform of the classic ERP grand-average, before the RIDE analysis. B, C, Scalp topographies of the responses to the semantic violation and control probes (left and middle), and the difference between them (right), in the early (B) and late (C) time windows where significant differences were observed. Right panels, The electrodes highlighted in white represent a significant difference. For RIDE-ERPs derived using different parameters, see Extended Data Figure 4-1. Distribution of individual-level data, and separation between participants with an ADHD diagnosis are shown in Extended Data Figure 2-1.
Figure 4-1
Top Figure: Classic Grand-average ERP to semantic violations versus its control from electrode Fz. Bottom Figures: Grand-average RIDE-ERP to semantic violations versus its control from electrode Fz, derived using different parameter choices (Ouyang et al., 2011a) (http://cns.hkbu.edu.hk/RIDE.htm). Left column: RIDE models that included a single C component. Right column: RIDE models that included two C components (C1 and C2). The time-windows defined for each component are stated above each figure. Shaded areas around each waveform indicate the standard error of the mean. Shaded vertical areas indicate the time periods in which significant differences were found between the two responses (p < 0.05, TFCE-corrected). The modulation of a positive peak between 200-400 ms in response to semantic violations was more susceptible to parameter choices in the RIDE analysis and was not significant in all iterations of this analysis (although the trend is observed in most parameter choices). This may be due to the more variable nature of the stimuli averaged here (which were no simple repetitions of the same word, as in the own-name case), and we would recommend replication attempts of these results in future research. Data from top-left panel is reported in the main paper. Download Figure 4-1, TIF file.
Detection of own-name
Figure 3 shows the comparison between neural responses to hearing one's own-name versus their personal control-name in the barista stream. Two prominent peaks are observed in the RIDE-ERP grand-average, a negative peak between 300 and 550 ms, and a later positive peak between 800 and 1200 ms, both of which were maximal at centro-parietal electrodes. Statistical analysis showed that the mid-latency negative peak was significantly larger in response to one's own-name versus its control, reaching significance primarily between 250 and 460 ms (significant electrodes marked in Fig. 3B; p < 0.025, two-tailed TFCE cluster correction; for replication with other RIDE parameters, see Extended Data Fig. 3-1).
Detection of semantic violation
Figure 4 shows the comparison between neural responses to words that create a semantic violation versus well-formed sentences in the barista stream. The RIDE-ERP grand-average to these words showed a series of three peaks: a positive peak between 250 and 450 ms, a negative mid-latency peak between 600 and 800 ms, and a later positive peak between 900 and 1200 ms. Statistical analysis indicated that both the early and late positive responses was significantly larger in response to semantic violations versus their controls (significant time windows: 200-420 ms and 940-1160 ms; p < 0.025, two-tailed TFCE cluster correction; significant electrodes marked in Fig. 4B; for replication with other RIDE parameters, see Extended Data Fig. 4-1).
GSR results
Figure 5 shows the comparisons between the grand-averaged GSR responses to hearing one's own-name and semantic violation versus their respective controls. Both waveforms show a wide positive peak in response to the probe-words, but not to their controls. Statistical comparison of the mean response in a 2-4 s time window following each probe showed significantly higher GSR responses to hearing one's own-name versus a control-name (p = 0.0207, t(42) = 2.104, p = 0.02, one-tailed paired t test; Cohen's d = 0.32 indicating a moderate effect size), as well as a higher response to semantic violations versus their control (p = 0.0364, t(42) = 1.841, p = 0.04, one-tailed paired t test; Cohen's d = 0.28 indicating a moderate effect size).
GSR grand-averaged responses to (A) hearing one's own-name versus control-name, and (B) hearing the semantic violation probes versus control words (n = 43). Shaded areas around both waveforms represent SEM. Distribution of individual-level data, and separation between participants with an ADHD diagnosis are shown in Extended Data Figure 2-1.
Eye tracking results
Analysis of spontaneous gaze-shifts performed during the experiment revealed that participants primarily looked at the target speaker (mean ± SD: 94.8 ± 7.5% of each trial). When not looking at the target speaker, the next most popular places to look at were (Fig. 6A, right) as follows: the ceiling, the target speaker's table, the pairs of characters to the left and to the right and the floor. Interestingly, the number of spontaneous gaze-shift varied substantially across individuals following a hazard-like distribution, with some participants performing frequent saccades away from the target speaker, whereas others remaining focused on the target speaker with nearly no saccades away (range across participants 0-40 gaze shifts/trial; mean = 3.47 gaze shifts/trial, SD = 4.71; Fig. 6B). However, we did not find any significant correlations between the frequency of gaze-shifts and the other dependent measures tested here (Pearson's correlations: OSPAN-WM score: r2 = 0.15, p = 0.39; behavior: r2 = −0.21, p = 0.2; GSR name effect: r2 = −0.24, p = 0.15; GSR semantic effect: r2 = 0.03, p = 0.86; ERP name effect: r2 = −0.2, p = 0.22; ERP semantic effect: r2 = 0.08, p = 0.61). Additional research is required to understand the factors contributing to these individual differences.
Eye-tracking results. A, Bar graph represents the percent of time/per trial that participants focused their gaze toward the target speaker versus other parts of the café. Pie chart represents the distribution of eye gaze to areas of the café other than the target speaker (using predefined ROIs). Distribution of individual-level data, and separation between participants with an ADHD diagnosis are shown in Extended Data Figure 2-1. B, Average number of gaze-shifts away from the target speaker/per trial, demonstrating the large variability across participants. Horizontal line indicates the mean across participants. C, Bar graphs represent the percent of epochs following probe-words and their controls that contained at least one gaze-shift away from the target speaker. Error bars indicate SEM. D, Same as in C, but only for gaze-shifts toward the barista character.
When testing whether the probability of performing a gaze-shift away from the target speaker increased after presentation of a probe word (within a 2 s time window), no significant effects were found. Gaze-shift were detected after ∼5% of the probe words, but this did not differ between own-name/semantic violation probes versus their respective controls (name probes: t(38) = 0.117, p = 0.46; semantic probes: t(38) = 0.97, p = 0.17; Fig. 6C). When limiting this analysis only to saccades from the target speaker to the barista, too few gaze-shifts were detected precluding reliable statistical analysis (Fig. 6D).
Discussion
The degree to which background and task-irrelevant speech is processed has been a topic of heated debate, with conflicting results reported over the years (Lachter et al., 2004; Beaman et al., 2007; Paavilainen, 2013; Olguin et al., 2018; Röer and Cowan, 2021). We introduce a novel approach to study this question under ecological circumstances that simulate real-life listening challenges. We found that the two probes tested here, hearing one's own-name and semantic violations, in the task-irrelevant barista speech, elicited robust neural and physiological response, supporting the notion that background speech is not merely encoded acoustically, but is also processed for its content. Interestingly, this detection was largely covert in nature, and probe-words did not elicit overt oculomotor capture.
The behavioral and neural sensitivity to hearing one's name is a well-known phenomenon, thought to reflect a special status of self-representation, possibly related to an evolutionary survival mechanism (Müller and Kutas, 1996; Perrin et al., 2005; Röer et al., 2013). The own-name advantage has been demonstrated in a variety of paradigms, primarily when it is within the focus of attention (Berlad and Pratt, 1995; Folmer and Yingling, 1997; Perrin et al., 1999, 2005; Holeckova et al., 2006; Höller et al., 2011; Eichenlaub et al., 2012; Tateuchi et al., 2012; Del Giudice et al., 2014; Lechinger et al., 2016; Jijomon and Vinod, 2021; Pinto et al., 2023). However, whether the advantage for detecting one's name persists when actively paying attention elsewhere has yielded conflicting results, and this effect varies substantially across studies and methods (Moray, 1959; Wood and Cowan, 1995; Conway et al., 2001; Naveh-Benjamin et al., 2014). Moreover, in many studies, detection of one's name is assessed only in post-experiment debriefing, which requires high levels of conscious awareness and is susceptible to memory failures. Here we found robust physiological and neural responses to one's name in the barista stream, providing direct evidence for its incidental detection, even when it is task-irrelevant and supposedly outside the focus of attention (see also Holtze et al., 2021).
In addition to the singular example of detecting one's name, our study addresses semantic processing of task-irrelevant speech more generally. The robust physiological and neural responses to semantic violations in the barista speech imply that the entire speech is monitored for its semantic content (Deutsch and Deutsch, 1963; Parmentier et al., 2018). This is because detecting the semantic violation in order-sentences such as “coffee for soup” does not rely simply on detecting salient words (as may be the case for hearing one's name) but requires integrative processing, structure-building, and syntactic analysis, above and beyond lexical identification of each word. Hence, these findings support the perspective that structure-building processes can be applied to speech even if it is task-irrelevant, and perhaps automatically, as suggested by several recent studies (Olguin et al., 2018; Har-shai Yahav and Zion Golumbic, 2021; ten Oever et al., 2022).
Interestingly, when debriefing participants after the experiment, most of them (94%) recalled hearing their name at some point; however, few recalled noticing any semantic violations (∼8%). It may be that subjective recall for one's name is better because of its personal relevance or because of priming effects from its frequent repetition, in contrast to semantic violations that were not associated with specific words. Moreover, we cannot know whether these differences in subjective recall reflect different levels of conscious awareness to the different probes (Lin and Yeh, 2014; Chen and Wyble, 2015; Kim et al., 2018) or whether they are related to memory failures (Bäuml et al., 2005). However, the large discrepancy between subjective recall for semantic violations versus the neural and physiological evidence of their detection highlights the complexity of relying on subjective-recall measures and the importance of including more direct and objective measures (Bentin et al., 1995).
We also note that, although some have suggested that individual differences in the tendency to notice semantic properties of irrelevant speech may be related to WMC (Colflesh and Conway, 2007; Hughes, 2014; Sörqvist and Rönnberg, 2014), this finding is inconsistent across studies (Beaman et al., 2007; Forster and Lavie, 2007; Röer et al., 2017a) and is not supported in the current data.
Neural responses to own-name and semantic violations
Although both probes elicited clear neural responses, the RIDE-ERP components where each effect was observed were different. Neural response to semantic violations showed two heightened mid-central responses between 200-400 ms and ∼1000 ms, whereas the own-name response manifested in an enhanced negative response at ∼400 ms. Before offering potential interpretations for these results, and why they may differ for the two probe types, we highlight a methodological difference between the RIDE-ERPs to the two probes that renders them not directly comparable to each other. The own-name probe and its control were repeated verbatim within-participant but were unique to each participant. Conversely, semantic violations and their controls consisted of several different words, which were repeated in all participants. These differences are inherent to the experimental design; however, they result in different levels of intrasubject and intersubject variability when averaging across stimuli and participants, which can affect the amplitude and time course of RIDE-ERPs. For this reason, we cannot directly compare between RIDE-ERPs to the two probes, but ensured that each probe had its own matched control derived in an identical manner. Bearing this in mind, we turn to discuss the specific effects found for each probe.
The time-scale, polarity, and scalp distribution of the own-name effect were consistent with the well-established N400 ERP component, a response associated with lexical and semantic processing (Kutas and Federmeier, 2011; Leckey and Federmeier, 2020). A heightened N400-like response to one's name has been reported previously (Müller and Kutas, 1996; Eichenlaub et al., 2012; Tamura et al., 2012), although some find a P300-like response, which is associated with target detection (Berlad and Pratt, 1995; Folmer and Yingling, 1997; Gray et al., 2004). An important ecological advantage of the current design is that, by embedding names within the barista stream, they were always fully contextual and did not create any incongruencies or artificial semantic disruptions (Holtze et al., 2021). The N400-like response observed here may reflect incidental detection of personally meaningful words, as part of the “natural” process of retrieving lexical information for incoming stimuli. Importantly, in another recent study in our laboratory, using a similar paradigm during a distributed attentional task, we also found modulation of an N400-like response to one's name (Pinto et al., 2023), strengthening the reliability of this effect and extending it to task-irrelevant speech as well.
For semantic violations embedded in the barista stream, we had also expected to find N400-like responses, as is often observed for semantically incongruent or unexpected words (Kutas and Federmeier, 2011; Leckey and Federmeier, 2020; Valderrama et al., 2021). However, here the N400 response did not differ for semantic violations versus their controls. Currently, we do not know whether this is because of methodological issues (e.g., variability across words), or whether this effect is truly not present for semantic violations in task-irrelevant speech (Bentin et al., 1995; Kallionpää et al., 2018). However, the N400 is not the only ERP component known to be affected by semantic incongruency. The two effects of semantic violations found here, a complex of positive- and negative-going deflections at mid-central electrodes between 200 and 400 ms, and a later mid-central positive response at ∼1000 ms, are largely consistent with the early N200/P250 complex and the late positive component (LPC), respectively.
Early N200/P250 responses are modulated by semantic priming (Hill et al., 2005) and semantic expectations and are considered the earliest neural marker of semantic processing (van den Brink et al., 2001; van den Brink and Hagoort, 2004; Heilbron et al., 2022). Interestingly, semantic congruency effects on these early responses are more prevalent for spoken versus written sentences (Hagoort and Brown, 2000; van den Brink et al., 2001; Toffolo et al., 2022). The LPC, which can have a relatively prolonged and variable latency (Leckey and Federmeier, 2020), is associated with more integrative and syntactic processing of sentences (Kaan et al., 2000; Steinhauer and Friederici, 2001; Friederici et al., 2002; Wassenaar and Hagoort, 2007; Vissers et al., 2008; Gouvea et al., 2010; Brouwer et al., 2012), and is modulated both by semantic and syntactic violations (Leckey and Federmeier, 2020). As mentioned above, in the current design, detection of semantic violations relies on integrative and structure-building processing, which is consistent with observing an LPC response and suggests that these processes may be applied automatically even when speech is outside the primary focus of attention (Olguin et al., 2018; Har-shai Yahav and Zion Golumbic, 2021; ten Oever et al., 2022). However, another interpretation is that one (or both) of the positive responses found here reflect a P300-like response, associated with capture of attention from salient background stimuli (Bell et al., 2010; Masson and Bidet-Caulet, 2019; Huang and Elhilali, 2020; Innes and Todd, 2022), since the LPC is sometimes also interpreted as a “delayed P300 response” (Hill et al., 2002; Hunter, 2016; Yang et al., 2019). Together, the current RIDE-ERP results provide clear and novel neural evidence for semantic processing of task-irrelevant speech under ecological conditions. However, they also leave many questions open as to the specific underlying cognitive/neural mechanisms, which require additional research and replications.
GSR responses
In addition to the neural responses, we observed a transient increase in physiological arousal, indexed by a GSR response evoked after both probe types. This response reflects the autonomic system's response to salient stimuli in the environment, associated with a “fight or flight” response (Mueller-Pfeiffer et al., 2014; Ho et al., 2020). Importantly, an increase in GSR is not only triggered by acoustic salience but is also observed for stimuli that are emotionally triggering (Bach et al., 2010) or personally relevant (Gronau et al., 2003), including one's name (Pinto et al., 2023). Hence, this response is considered by some as a “somatic marker,” reflecting the critical connection between the brain and the body in driving cognitive behavior and decision-making (Bechara et al., 2005; Christopoulos et al., 2019). The inclusion of GSR measurements here provided converging evidence for the neural data, supporting incidental detection of one's own-name and semantic violations in task-irrelevant speech, and indicating that it is accompanied by a transient increase in arousal.
Eye-gaze
Interestingly, hearing one's name or semantic violations did not elicit gaze-shifts toward the barista (or anywhere else in the café). This was counter to our expectation that implicit detection would capture exogenous attention overtly, as is often the case for salient visual event (Ludwig et al., 2008; Mulckhuyse et al., 2008; Nissens et al., 2017). This type of oculomotor capture has not been studied extensively with respect to auditory or speech stimuli; and currently, we can only speculate as to why the implicit detection of own-name and semantic probes did not elicit gaze-shifts. One possibility is that, although these events were detected, this detection remained preconscious and did not lead to full exogenous attention capture. Another possibility is that individuals intentionally suppress oculomotor capture, perhaps because of the socially relevant context (Gaspelin and Luck, 2019; Adams and Gaspelin, 2021). Since eye movements are a critical component of real-life perception, studying the factors governing spontaneous gaze-patterns as well as oculomotor capture by background events is critical for furthering our understanding of real-life attention.
In conclusion, the combined neural, ocular, and physiological data collected in an immersive virtual-reality platform, where participants contend with a realistic multitalker environment, brings us closer to understanding how the brain deals with the perceptual complexity of real-life environments. Our results demonstrate the nonexclusive nature of real-life attention, whereby even if one stimulus is of primary interest to an individual, they also pick up words and form linguistic representation of other speech stimuli around them. This multiplexed perspective of real-life attention likely carries ecological benefits, allowing individuals to notice and react to important events around them, rather than approaching the world with “tunnel vision” where all available processing resources are allocated to only one stimulus of interest.
Footnotes
This work was supported by Israel Science Foundation Grant 2339/20 to E.Z.-G. and Israel Ministry of Science Grant 88962 to E.Z.-G. We thank Dr. Maya Kaufman for consulting on stimulus design; and Orel Levi for assistance in data collection.
The authors declare no competing financial interests.
- Correspondence should be addressed to Elana Zion-Golumbic at elana.zion-golumbic{at}biu.ac.il