Skip to main content

Main menu

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Collections
    • Podcast
  • ALERTS
  • FOR AUTHORS
    • Information for Authors
    • Fees
    • Journal Clubs
    • eLetters
    • Submit
    • Special Collections
  • EDITORIAL BOARD
    • Editorial Board
    • ECR Advisory Board
    • Journal Staff
  • ABOUT
    • Overview
    • Advertise
    • For the Media
    • Rights and Permissions
    • Privacy Policy
    • Feedback
    • Accessibility
  • SUBSCRIBE

User menu

  • Log out
  • Log in
  • My Cart

Search

  • Advanced search
Journal of Neuroscience
  • Log out
  • Log in
  • My Cart
Journal of Neuroscience

Advanced Search

Submit a Manuscript
  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Collections
    • Podcast
  • ALERTS
  • FOR AUTHORS
    • Information for Authors
    • Fees
    • Journal Clubs
    • eLetters
    • Submit
    • Special Collections
  • EDITORIAL BOARD
    • Editorial Board
    • ECR Advisory Board
    • Journal Staff
  • ABOUT
    • Overview
    • Advertise
    • For the Media
    • Rights and Permissions
    • Privacy Policy
    • Feedback
    • Accessibility
  • SUBSCRIBE
PreviousNext
Articles, Behavioral/Cognitive

Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a “Cocktail Party”

Elana Zion Golumbic, Gregory B. Cogan, Charles E. Schroeder and David Poeppel
Journal of Neuroscience 23 January 2013, 33 (4) 1417-1426; https://doi.org/10.1523/JNEUROSCI.3675-12.2013
Elana Zion Golumbic
1Department of Psychiatry, Columbia University College of Physicians and Surgeons, New York, New York, 10032;
2Cognitive Neuroscience and Schizophrenia Program, Nathan S. Kline Institute for Psychiatric Research, Orangeburg, New York, 10962;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gregory B. Cogan
3Department of Psychology, New York University, New York, 10003
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Charles E. Schroeder
1Department of Psychiatry, Columbia University College of Physicians and Surgeons, New York, New York, 10032;
2Cognitive Neuroscience and Schizophrenia Program, Nathan S. Kline Institute for Psychiatric Research, Orangeburg, New York, 10962;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
David Poeppel
3Department of Psychology, New York University, New York, 10003
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Article
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF
Loading

Abstract

Our ability to selectively attend to one auditory signal amid competing input streams, epitomized by the “Cocktail Party” problem, continues to stimulate research from various approaches. How this demanding perceptual feat is achieved from a neural systems perspective remains unclear and controversial. It is well established that neural responses to attended stimuli are enhanced compared with responses to ignored ones, but responses to ignored stimuli are nonetheless highly significant, leading to interference in performance. We investigated whether congruent visual input of an attended speaker enhances cortical selectivity in auditory cortex, leading to diminished representation of ignored stimuli. We recorded magnetoencephalographic signals from human participants as they attended to segments of natural continuous speech. Using two complementary methods of quantifying the neural response to speech, we found that viewing a speaker's face enhances the capacity of auditory cortex to track the temporal speech envelope of that speaker. This mechanism was most effective in a Cocktail Party setting, promoting preferential tracking of the attended speaker, whereas without visual input no significant attentional modulation was observed.

These neurophysiological results underscore the importance of visual input in resolving perceptual ambiguity in a noisy environment. Since visual cues in speech precede the associated auditory signals, they likely serve a predictive role in facilitating auditory processing of speech, perhaps by directing attentional resources to appropriate points in time when to-be-attended acoustic input is expected to arrive.

Introduction

Understanding speech, particularly under noisy conditions, is significantly facilitated by viewing the speaker's face (Sumby and Pollack, 1954; Grant and Seitz, 2000; Schwartz et al., 2004). Multiple studies indicate that visual input affects neural responses to speech, both in early sensory cortices and higher order speech-related areas (Besle et al., 2004; Davis et al., 2008; McGettigan et al., 2012). However, little is known about the neural dynamics by which visual input facilitates on-line speech perception, since the majority of electrophysiological studies to date have focused on audiovisual (AV) effects of processing individual syllables, in rather unnatural laboratory paradigms (van Wassenhove et al., 2005).

Recently, important advances have been made in the ability to quantify the neural response to continuous speech. Converging evidence across several methodologies indicates that low-frequency neural activity in auditory cortex (<15 Hz) phase lock to the temporal envelope of speech, which fluctuates at similar rates (Rosen, 1992; Luo and Poeppel, 2007; Aiken and Picton, 2008; Lalor and Foxe, 2010; Hertrich et al., 2012; Peelle et al., 2012). The temporal envelope of speech is critical for comprehension (Shannon et al., 1995; Drullman, 2006) and it has been suggested that this “envelope-tracking” response serves to parse the continuous input into smaller units (syllables or phrases) to which higher order decoding of the fine structure is applied, allowing for syllable classification and word recognition (Ghitza, 2011; Giraud and Poeppel, 2012).

To date, studies investigating the envelope-tracking response have mainly used auditory stimuli. However, articulatory facial movements are also correlated with the speech envelope and precede it by ∼150 ms (Grant and Seitz, 2000; Kim and Davis, 2003; Chandrasekaran et al., 2009). Thus, theoretically, viewing the talking face can provide predictive cues for upcoming speech and facilitate its processing (van Wassenhove et al., 2005; Schroeder et al. 2008; Arnal et al. 2011). In this paper we investigate whether viewing the speaker's face indeed enhances the envelope-tracking response in auditory cortex.

Since the benefit of congruent visual input to speech perception is greatest under difficult auditory conditions (Sumby and Pollack, 1954; Callan et al., 2003; Ross et al., 2007), we investigated AV effects on the envelope-tracking response under two conditions: when listening to a single speaker and when attending to one speaker while ignoring a concurrent irrelevant speaker (Cherry, 1953; McDermott, 2009 (simulating a “Cocktail Party”). Previous studies, using auditory-only stimuli, have shown that the envelope-tracking response of the attended speaker is amplified compared with the ignored speaker (Kerlin et al., 2010; Ding and Simon, 2012; Mesgarani and Chang, 2012). Nonetheless, the response to the ignored speaker remains robust, which can lead to behavioral interference (Cherry, 1953; Moray, 1959; Wood and Cowan, 1995; Beaman et al., 2007).

Here we recorded magnetoencephalographic (MEG) signals as human participants attended to segments of natural speech, presented either with or without the corresponding talking face. We investigated whether viewing the talking face enhances the envelope-tracking response and whether visual input improves the preferential tracking of the attended speaker in a Cocktail Party environment (Zion Golumbic et al., 2012).

Materials and Methods

Participants

Thirteen native English-speaking participants (eight female, median age 22, one left handed) with normal hearing and no history of neurological disorders provided informed consent according to the University Committee on Activities Involving Human Participants at New York University. All participants but one were right handed as assessed by the Edinburgh Inventory of Handedness (Oldfield, 1971).

MEG recordings

MEG data were collected on a 157-channel whole-head MEG system (5 cm baseline axial gradiometer SQUID-based sensors; KIT) in an actively magnetically shielded room (Vakuumschmelze GmbH). Data were sampled at 1000 Hz, with a notch filter at 60 Hz, and an on-line recording 200 Hz lowpass filter. Each participant's head position was assessed via five coils attached to anatomical landmarks both before and after the experiment to ensure that head movement was minimal. Head-shape data were digitized using a 3D digitizer (Polhemus). The auditory signals were presented though in-ear earphones (Etymotic ER3-A) and the speech sounds were presented at comfortable conversational levels (∼72 dB SPL). The visual materials were presented on a rear-projection screen in the shielded room (∼18° horizontal and 11° vertical visual angles, ∼ 44 cm from eyes; Infocus LP 850 projector). Stimulus delivery and triggering were controlled by Presentation (Neurobehavioral Systems).

Experimental design

The stimuli consisted of eight movie clips of two speakers (one male, one female; four movies per speaker) reciting a short passage (9.8 ± 1.5 s). The movies were edited using QuickTime Pro (Apple) to align the faces in the center of the frame, equate the relative size of the male and female faces, and to clip the movies appropriately. Each female movie was paired with a male movie of approximately similar length (<0.5 s difference in length), and this pairing remained constant throughout the entire study, yielding four stimulus pairs. In each trial, a stimulus pair was selected, and either one of the stimuli from the pair was presented individually (Single Speaker) or both stimuli were presented simultaneously (Cocktail Party). The stimuli were presented either with or without the video (AV/A), yielding a total of four conditions: AVsingle, AVcocktail, Asingle, Acocktail (see Fig. 1A). To ensure sufficient repetitions of each stimulus in a particular same condition, the attribution of stimulus pair to condition was held constant across the experiment. Specifically, two stimulus pairs were used for the AV conditions and two for the A conditions, and in both cases the same stimulus pairs were used for the Single Speaker and Cocktail Party conditions. The envelopes of all stimulus pairs were uncorrelated (Pearson correlation coefficient r <0.065 for all pairs).

In all conditions, the audio signal was presented diotically, so the auditory streams could not be segregated based on spatial cues. The videos were presented on either side of a computer screen, and the location of each speaker (left/right) was assigned randomly in each trial. In the A conditions, rectangular placeholders were presented on both sides of the screen instead of the videos.

Before each trial, instructions appeared in the center of the screen indicating which of the speakers to attend to (e.g., “Attend Female”). The participants indicated with a button press when they were ready to begin, and the stimuli started playing 2 s after their response. The location of the to-be-attended speaker was highlighted by a red frame and the verbal instruction remained in the center of the screen, to ensure the participants remembered which speaker to attend to during the entire trial. Each stimulus was cut off before the speaker uttered the last word, and then a target word appeared in the center of the screen. Participants' explicit task was to indicate via button press whether the target word was a congruent ending to the attended passage. For example: passage: “…my parents thought that idea was” Target words: silly/amusing/funny (congruent); purple/cloudy/hanging (incongruent). Target words were unique on each trial (no repetitions), and 50% were congruent with the attended segment. Progression to the next trial was self-paced.

There were a total of 40 trials in each condition, with each individual stimulus designated as the “attended” stimulus in 10 trials. The order of the trials was randomized throughout the experiment. Breaks were given approximately every 10 min, and the total duration of the experiment was ∼45 min.

Data analysis

All analyses were performed using MATLAB (MathWorks) and the Fieldtrip toolbox (Oostenveld et al., 2011). The data were noise reduced off-line using a time-shift Principled Component Analysis (de Cheveigné and Simon, 2007). Ocular and cardiac artifacts were corrected using ICA decompositions (Jung et al., 2000). The data were visually inspected and trials with large artifacts were removed from further analysis. The data were initially segmented into trials starting 4 s before stimulus onset and lasting for 16 s poststimulus to include the entire duration of all stimuli and to avoid contamination of edge effects in subsequent filtering and spectral analysis.

Although behavioral performance was ∼80% correct, we nonetheless included both correct and incorrect trials in our analysis. This decision was mainly due to the small number of trials per stimulus (n = 10), since the electrophysiological measures used here are highly sensitive to number of trials and need to be equated across stimuli. Due to limitations on the total length of the experiment it was not feasible to substantially increase the number of trials. We recognize that this may potentially weaken the size of our effects, since we cannot know if incorrect responses are due to lapses of attention or whether attention was appropriately allocated, but participants made mistakes in the comprehension task. Nonetheless, we argue that any significant effects found despite this caveat are valid, because including incorrect trials works against our hypothesis.

We performed two complementary analyses to evaluate the envelope-tracking responses in the MEG signal.

Phase dissimilarity/selectivity analysis.

For the Single Speaker trials, we computed a “phase-dissimilarity index,” introduced by Luo and Poeppel (2007), which characterizes the consistency and uniqueness of the temporal (phase) pattern of neural responses to different speech tokens. The rationale behind this analysis is to compare the phase consistency across repetitions of the same stimulus (within-stimulus) with a baseline of phase consistency across trials in which different stimuli were presented (across-stimuli).

Since the stimuli differed somewhat in their duration, this analysis focused on the first 8 s of each epoch, matching the duration of the shortest stimulus. For each participant and each sensor, we first estimated the momentary phase in single trials. Since previous studies have shown phase dissimilarity effects primarily in frequencies <10 Hz, we performed a wavelet decomposition of single trials using a complex 6-cycle Morlet wavelet in logarithmic steps between 0.5 and 15 Hz, resulting in 51 frequency points. Next, we calculated the intertrial phase-locking value (ITC; Eq. 1) at each time-frequency point, across all trials in which the same stimulus was presented. For each frequency level, the ITC time course was averaged over time (0–8 s) and across all stimuli to obtain the average within-stimulus ITC. Embedded Image The across-stimuli ITC was estimated using the same approach but using shuffled data, such that that the ITC was computed across randomly selected trials in which different stimuli were presented. The phase dissimilarity index is computed as the difference between the within-stimuli and the across-stimuli ITC (Eq. 2a). Large phase-dissimilarity values indicate that the responses to individual stimuli have a highly consistent time course as evidenced in the response to single trials. This analysis was performed separately for the A and AV trials. Embedded Image For the Cocktail Party trials, we computed a “phase-selectivity index,” which is based on the same logic as the phase-dissimilarity index but is designed to determine how attention modulates the temporal pattern of the neural response, given a particular pair of speakers. To this end, we calculated the within-attention and across-attention ITC (Eq. 2b), defined as follows. The within-attention ITC was computed across all trials in which the same pair of speakers was presented and the same speaker was attended. The across-attention ITC was computed across trials in which the same pair of speakers was presented, but with a random mixture of attend-female and attend-male trials. The within-attention and across-attention ITCs are then averaged over time and stimulus pairs and subtracted from each other yielding the phase-selectivity index (Eq. 2b).

Large phase-selectivity values indicate that the time course of the neural responses is influenced by attention and is substantially different when attending to different stimuli, despite the identical acoustic input. In contrast, low phase selectivity would suggest similar patterns of the neural responses despite attending to different speakers, a pattern that likely represents a mixture of responses to both speakers, which is not modulated by attention.

To select channels for statistical analysis, we collapsed the phase-dissimilarity and phase-selectivity indices across all conditions, frequencies, and participants, and selected the 20 channels with the highest averaged values. The procedure ensured that channel selection was not biased by condition or frequency band. We then averaged the phase-dissimilarity and phase-selectivity indices across those 20 channels separately for each condition and frequency band. Since ITC values are not normally distributed, we applied a rau transformation phase-dissimilarity and phase-selectivity indices before make them suitable for linear parametric statistical testing (Studebaker, 1985). For each condition, we determined which frequencies had significant phase dissimilarity/selectivity using a t test at each frequency level. We controlled for multiple comparisons by requiring clusters of at least four consecutive frequency points at a significance level of p < 0.01. To evaluate AV effects, we performed a paired t test between the average phase dissimilarity/selectivity in the AV and A conditions, and separately for the Single Speaker and Cocktail Party conditions.

Temporal response function.

To determine the relationship between the neural response and the presented speech stimuli, we estimated a linear temporal response function (TRF) between the stimulus and the response. The neural response r(t) is modeled by the temporal envelope of the presented speaker s(t) as follows: Embedded Image The TRF(t) is a linear kernel and ε(t) is the residual response not explained by the model (Ding and Simon, 2012). The broadband envelope of speech s(t) was extracted by filtering the speech stimuli between 250 and 4000 Hz and extracting the temporal envelope using a Hilbert transform. The temporal response functions TRF(t) were fitted using normalized reverse correlation as implemented in the STRFpak MATLAB toolbox (http://strfpak.berkeley.edu/) (Theunissen et al., 2001; Lalor and Foxe, 2010). Normalized reverse correlation involves inverting the autocorrelation matrix of the stimulus, which is usually numerically ill conditioned. Therefore, a pseudo-inverse is applied instead, which ignores eigenvalues of the autocorrelation matrix that are smaller than a predefined tolerance factor. The tolerance factor was scanned and determined by a pre-analysis to optimize the predictive power and then fixed for all sensors and participants.

TRF was estimated independently for each participant at each sensor, and separately for each of the four conditions. For the Single Speaker conditions, r(t) was a concatenated vector of the responses to each stimulus averaged over trials, and s(t) was a vector of the envelopes, concatenated in the same manner. In the Cocktail Party conditions, r(t) was a concatenated vector of the responses to each attended-stimulus averaged over trials, and it was modeled by the temporal envelopes of both the attended and ignored speakers (sA(t) and sI(t), respectively), generating a temporal response function for each speaker (TRFA and TRFI, respectively). Embedded Image If the two films presented in the same trial had different lengths, only the portion of the stimulus that overlapped in time was included in the model, and the response r(t) to that stimulus pair was truncated accordingly. Both r(t) and s(t) were downsampled to 100 Hz before model estimation.

The TRFs were 300 ms long (30 estimated points) and were estimated using a jackknife cross-validation procedure to minimize effects of over-fitting (Ding and Simon, 2012). In this procedure, given a total of N stimuli, a TRF is estimated between s(t) and r(t) derived from N − 1 stimuli, and this estimate is used to predict the neural response to the left-out stimulus. The goodness of fit of the model was evaluated by the correlation between the actual neural response and the model prediction, called predictive power (David et al., 2007). The predictive power calculated from each jackknife estimate is averaged.

To evaluate the significance of the TRF estimate, we repeated the cross-validation procedure for each participants and each sensor on surrogate data, mismatching the stimulus and responses vectors. The statistical significance of the predictive power of TRF estimation from the real data was evaluated by comparing it to the predictive power of the surrogate TRFs using a paired t test. Similarly, we evaluated the significance of the peak amplitude of the estimated TRF by comparing it to the amplitude of the surrogate TRFs at the same time point using a paired t test.

Source reconstruction

The analyses were repeated in source space for five of the participants for whom structural magnetic resonance imaging (MRI) was available. All source reconstructions were done using minimum norm estimation (Hämäläinen and Ilmoniemi, 1994).

Each participant's structural MRI was reconstructed using the FreeSurfer suite (http://surfer.nmr.mgh.harvard.edu/) to produce a 3D image of their MRI. This was used to localize neural activity onto the brain. The source space was set up such that each participant's brain surface contained ∼20480 “triangles” of localized dipoles. Due to computational constraints, this value was downsampled using a triangulation procedure that recursively subdivides the inflated spherical surface into icosahedrons and then subdivides the number of triangles (sides) of these icosahedrons by a factor of four for the TRF analysis, and by a factor of 16 for the phase-dissimilarity calculations. This produced, for the TRF analysis, a source space with 2562 sources per hemisphere, with an average source spacing of ∼6.2 mm. For the ITC calculations, this produced 642 sources per hemisphere, with an average source spacing of ∼10 mm.

The forward solution was computed using the decimated reconstruction surface as well as the boundary element model information computed for a single compartment (homogenous) model for MEG data only.

The inverse operator was then computed using the forward solution as well as the noise covariance matrix computed from each participant; no task data were collected at the beginning of the experimental session. A depth weighting of 0.8 and a regularization parameter, λ2, of 0.1 were used.

The orientation of the sources was fixed to be normal to cortical surface, as the primary source of the MEG signal is thought to originate from postsynaptic potentials of the apical dendrites of large pyramidal cells orientated perpendicular to the cortical surface (Hämäläinen and Ilmoniemi, 1994). Input data for the ITC calculations were the raw time series of the neural responses to each of the stimuli for each participant. Single trial time-frequency analysis was performed on the source reconstructed data with the same analysis as in the sensor space analysis (wavelet decomposition using a complex 6-cycle Morlet wavelet, logarithmically stepped between 0.5 and 15 Hz). For the TRF analyses, the TRF function estimated at the sensor level was used as input into the source reconstruction and no further time-frequency analysis was performed.

Individual participant brains were averaged onto a common participant brain using the FreeSurfer Suite using a transformation based on Montreal Neurological Institute coordinates. Participant data were smoothed across participants (on the participant average brain) using a Gaussian smoothing function.

Results

Behavioral results

Task performance was generally good, indicating that the participants were attending appropriately according to instructions. In the Single Speaker conditions, performance was equally good in the A and AV trials (hit rates between 78 and 80%; t(12) = 1.12, p > 0.2); however, in the Cocktail Party conditions performance was significantly reduced in the A trials compared with the AV trials (hit rates of 74 vs 81%, respectively; t(12) = 2.44, p < 0.05; Fig. 1B).

Figure 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1.

Paradigm and behavioral results. A, Illustration of the four experimental conditions. In each trial, a stimulus pair was presented with or without the corresponding video of the speakers (AV/A) and either one speaker or both speakers were audible (Single Speaker/Cocktail Party). The red rectangle indicated which speaker was designated to be attended. Trial order was randomized throughout the experiment. B, Ratio of correct responses, averaged across participants (±1 SEM). Performance was generally good in all conditions, indicating that participants were indeed attending to the prescribed speaker. Performance was reduced slightly, but significantly, in the A Cocktail Party condition suggesting that selective attention was more challenging in the absence of visual input.

MEG results

We used two complementary approaches to quantify neural envelope tracking and its selectivity in the Cocktail Party condition. The first phase dissimilarity/selectivity approach evaluates the consistency of the neural responses across repetitions of the same stimulus, whereas the second temporal response function estimation approach computes the direct relationship between the stimulus and the neural response it elicits.

Phase dissimilarity and selectivity

Figure 2A illustrates qualitatively that listening to different speech tokens in the Single Speaker condition elicits markedly different temporal patterns of the neural response, as previously demonstrated (Luo and Poeppel, 2007). This effect was quantified by calculating the phase-dissimilarity index, separately for A and AV Single Speaker trials. Significant phase dissimilarity was found for frequencies between 2 and 10 Hz (p < 0.01; Fig 3A) in both conditions, indicating that the phase of neural activity in this frequency range faithfully tracked the presented speech token. The spatial distribution of the phase-dissimilarity index in this frequency range was typical of auditory responses. The difference between the phase dissimilarity in the A and AV conditions was not significant (t < 1.0), indicating that adding visual input did not significantly improve the representation of the stimulus in auditory cortex.

Figure 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2.

Time course of MEG responses. Example time courses of the neural response to the different speech tokens, averaged over trials and participants and filtered between 1 and 10 Hz. Data are shown averaged over the five sensors with the strongest auditory response (with negative polarity), indicated by green dots in the topographical map on the right. A, Neural response to two different tokens presented Auditory Single Speaker condition. Replicating previous findings (Luo and Poeppel, 2007), different speech tokens elicit strikingly different time courses of response. B, Neural responses when attending either to the female or male speaker in the Auditory Cocktail condition. The time courses in the two attentional conditions overlap significantly, with no apparent attentional modulation. C, Neural responses when attending either to the female or male speaker in the Audio-Visual Cocktail condition. In this case, the time courses in the two attentional conditions do not overlap, but rather the temporal pattern of the response is highly influenced by attention, yielding markedly different patterns when attending to different speakers, despite the identical acoustic input.

Figure 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 3.

Phase dissimilarity and selectivity. A, Phase-dissimilarity values at frequencies between 1 and 16 Hz for the Single Speaker trials in the AV (left) and A (right) conditions, averaged across participants at the top 20 channels (gray shadow indicates ±1 SEM over participants). The red line indicates frequencies where phase dissimilarity was significant in each condition (p < 0.01) B, Phase selectivity values between 1 and 16 Hz for the Cocktail Party trials in the AV (left) and A (right) conditions, averaged across participants at the top 20 channels (gray shadow indicates ±1 SEM over participants). The red line indicates frequencies where phase selectivity was significant in each condition (p < 0.01) C, Average values of phase dissimilarity/selectivity between 4 and 8 Hz in all conditions. D, Topographical distribution of phase dissimilarity/selectivity values, averaged across all conditions and participants. This distribution is typical of auditory responses.

In the Cocktail Party conditions we calculated a phase-selectivity index, which follows the same logic as the phase-dissimilarity index but characterizes how different the temporal pattern of the neural responses is when attention is directed toward different speakers (attend female vs attend male), even though the acoustic input remains the same (combination of the two voices). In this case, high phase selectivity indicates selective representation (tracking) of the attended stimulus, whereas low phase selectivity indicates similar patterns of the neural responses, despite attending to different speakers; this pattern likely represents a mixture of responses to both speakers with no detectable modulation by attention. Examples of how allocating attention to different speakers influenced the time course of the neural response in the A and AV cocktail conditions are shown in Figure 2B,C.

Significant phase selectivity was found in the AV cocktail condition between 3 and 8 Hz (Fig 3B) and shared a similar auditory spatial distribution as the phase dissimilarity in the Single Speaker condition. Crucially, the phase selectivity in the A cocktail condition did not reach the threshold for significance at any frequency, and was significantly lower than in the AV cocktail condition (t(12) = 3.07, p < 0.01).

The implication of these results is that viewing the face of an attended speaker in a Cocktail Party situation enhances the capacity of auditory cortex to selectively represent and track that speaker, just as if that speaker were presented alone. However, this capacity is sharply reduced when relying only on auditory information alone, with the neural response showing no detectable selectivity.

We next investigated whether the observed envelope tracking in the A and AV conditions indeed reflects activity in auditory cortex, or whether when viewing movies a similar tracking response is found in visual cortex (or both). To this end, we repeated the analysis in source space for five of the participants for whom structural MRIs were available. Results show that the phase-dissimilarity and phase-selectivity effects are entirely attributed to auditory cortex, in both the A and AV conditions (Fig. 4). The source localization results did not adhere to strict anatomical boundaries (e.g., the Sylvian Fissure), but rather formed a region of activity centered in auditory regions but extending beyond these areas. This pattern is likely due to limitations of the source reconstruction given the low number of participants for this analysis (five), as well as field spread of low-frequency signals (see discussion by Schoffelen and Gross, 2009; Peelle et al., 2012).

Figure 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 4.

Source reconstruction of the phase dissimilarity/selectivity. Reconstruction of ITC between 4 and 8 Hz was performed in each condition for a subset of five participants. Hot and cold colors, respectively, represent strong and weak ITC. Results indicate that phase tracking originates in auditory cortex in all conditions, but is substantially reduced in the A Cocktail condition. Notably, no evidence for phase tracking is found in visual cortex in either of the AV conditions.

TRF estimation

The phase-dissimilarity index gives a robust estimate of how well the neural response represents a particular stimulus, but it remains an indirect measure of envelope tracking. A more direct approach is to estimate a TRF, which models the relationship between speech stimuli and the neural responses they elicit, as well as the temporal lag between them (Theunissen et al., 2001; Lalor and Foxe, 2010; Ding and Simon, 2012). We estimated the TRFs between the speech envelope and the neural response in each of the four conditions. In the Cocktail Party conditions, the TRF model was estimated using the envelopes of both the attended and ignored speakers, which allowed us to compare the responses to each of the concurrently presented stimuli and estimate the relative contribution of each stimulus to the measured response (see Materials and Methods). The predictive power of the model (averaged across all conditions) was significantly higher than chance (t(12) = 3.56, p < 0.005 vs surrogate TRFs), confirming the robustness of the estimation.

Results of the TRF analysis are shown in Figure 5. The TRF, averaged across all conditions, had a significant peak at ∼50 ms (p < 0.01 vs surrogate TRFs), which displayed the spatial distribution of an auditory response, indicating that the neural response in auditory cortex tracked the speech envelope with a lag of ∼50 ms. The absolute value of this peak was taken to reflect the strength of the envelope tracking, which we compared across the different conditions. In the Single Speaker condition, TRF amplitude was significantly higher for AV versus A trials (t(12) = 2.7, p < 0.05). Since the phase dissimilarity was significant for both AV and A trials, indicating that different tokens elicit unique temporal patterns of neural response in both conditions, this additional TRF effect could either reflect visual enhancement of the amplitude neural response or improved temporal tracking of the speaker in the AV condition.

Figure 5.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 5.

TRF analysis. A, Estimated TRF waveforms across all conditions, averaged over the top 10 sensors with positive TRF peak polarity. Top, TRFs to AV and A speech in the Single Speaker condition. The TRFs share a similar time course, with a prominent peak at 50 ms, which is larger in the AV versus A condition. Bottom, TRFs to attended and ignored speakers in the AV (left) and A (right) conditions. In the AV condition the response is strikingly selective for the attended speaker whereas in the A condition similar responses are found for attended and ignored speakers. B, Bar graphs depicting the average TRF peak amplitude at 50 ms across all conditions (absolute value), averaged across the top 20 sensors. C, Topographical distribution of TRF peak amplitude averaged over all participants and conditions (left) and source reconstruction of the TRF peak from five participants (right), indicating it originates in auditory cortex.

In the Cocktail Party condition, we performed a two-way ANOVA between Modality (A/AV) and Speaker (attended/ignored). There was a main effect of Modality (F(1,12) = 11.9, p < 0.005), which shows that tracking was overall better for AV versus A stimuli. There was also a main effect of Type (F(1,12) = 5.6, p < 0.05), and the interaction between the factors trended toward significance (F(1,12) = 3.4, p = 0.08). Post hoc analyses indicated that for the AV stimuli, TRF amplitude was significantly higher for the attended versus the ignored speaker (t(13) = 2.6, p < 0.05), but this difference was not significant in the A trials (t < 1.0). Source reconstruction of the TRF signal in a subset of five participants confirmed that the envelope-tracking response originated in auditory cortex (Fig 5B). No evidence for an envelope-tracking response was found in visual regions.

Discussion

The findings demonstrate that viewing a speaker's face enhances the capacity of auditory cortex to track the temporal speech envelope of that speaker. Visual facilitation is most effective in a Cocktail Party setting, and promotes preferential or selective tracking of the attended speaker, whereas without visual input no significant preference for the attended is achieved. This pattern is in line with behavioral studies showing that the contribution of congruent visual input to speech processing is most substantial under noisy auditory conditions (O'Neill, 1954; Sumby and Pollack, 1954; Helfer and Freyman, 2005; Ross et al., 2007; Bishop and Miller, 2009). These results underscore the importance of visual input in resolving perceptual ambiguity and in directing attention toward a behaviorally relevant speaker in a noisy environment.

Behavioral studies also show robust interference effects from ignored stimuli (Cherry, 1953; Moray, 1959; Wood and Cowan, 1995; Beaman et al., 2007), which are, arguably, due to the fact that ignored stimuli are also represented in auditory cortex (albeit often with reduced amplitude; Woldorff et al., 1993; Ding and Simon, 2012). Thus, improving the selectivity of auditory tracking is likely to have causal implications for performance, and indeed here we show that in the Cocktail Party conditions performance improved in the AV compared with the A condition, alongside an increase in selective tracking.

Visual cues facilitate speech-envelope tracking

Multiple studies have shown that congruent visual input enhances the neural responses to speech in auditory cortex and in higher order speech-related areas (Callan et al., 2003; Sekiyama et al., 2003; Besle et al., 2004; Bishop and Miller, 2009; McGettigan et al., 2012); however, exactly how the visual input influences activity in auditory cortex is not well understood. In this study we find that when a Single Speaker is presented, reliable low-frequency phase tracking is achieved in auditory cortex both with and without visual input (as indicated by equivalent degrees of phase dissimilarity), yet the amplitude of this tracking response is enhanced in the AV condition (shown by a larger TRF peak amplitude), implying visual amplification of the auditory response. In the Cocktail Party case the role of visual input becomes more crucial since in its absence similar responses are obtained for attended and ignored speakers yielding no significant selectivity, whereas viewing the talking face allows auditory cortex to preferentially track the attended at the expense of the ignored speaker.

Which aspects of the visual input facilitate envelope tracking in auditory cortex? Two types of facial gestures in speech, which operate on different timescales, have been shown to correlate with speech acoustics and improve speech processing. The first are articulation movements of the mouth and jaw, which are correlated with the temporal envelope of speech, both of which are temporally modulated at rates between 2 and 7 Hz (Grant and Seitz, 2000; Chandrasekaran et al., 2009), commensurate with the syllabic rate of speech and the frequency range where we and others have found phase-tracking effects. Several studies have demonstrated increased speech recognition and intelligibility when acoustic input was accompanied by visualization of articulatory gestures (Grant and Seitz, 2000; Grant, 2001; Kim and Davis, 2003). Beyond the contribution of articulatory gestures, other aspects of body movements, such as head and eyebrow movements, have also been shown to improve speech recognition (Munhall et al., 2004; Scarborough et al., 2009). Head movements and other body motions are linked to the production of suprasegmental features of speech such as stress, rhythmicity, and other aspects of prosody (Birdwhistell, 1970; Bennett, 1980; Hadar et al., 1983).

Common to both types of facial gestures is the fact that they precede the acoustic signal. Facial articulation movements precede speech by 100–300 ms (Grant and Seitz, 2000; Chandrasekaran et al., 2009), and the onset of head movements generally precedes the onset of stressed syllables by at least 100 ms (Hadar et al., 1983). This has lead to the suggestion that visual input assists speech perception by predicting the timing of the upcoming auditory input. Estimating when auditory input will occur serves to enhance sensitivity and promote optimal processing in auditory cortex. Supporting this proposal, Schwartz et al. (2004) demonstrated improved intelligibility when auditory syllables were presented together with lip movements that predicted the timing of auditory input, even if the visual cues themselves carried no information about the identity of the syllable (Kim and Davis, 2003). Further supporting the predictive role carried by visual input, neurophysiological data by Arnal et al. (2009, 2011) demonstrate that the early neural response to AV syllables (<200 ms) is enhanced in a manner proportional to the predictive value carried by the visual input.

The predictive role assigned to visual cues is in line with the “Attention in Time” hypothesis (Large and Jones, 1999; Jones et al., 2006; Nobre et al., 2007; Nobre and Coull, 2010), which posits that attention can be directed to particular points in time when relevant stimuli are expected, similar to the allocation of spatial attention. Speech is naturally rhythmic, and predictions about the timing of upcoming events (e.g., syllables) can be formed from temporal regularities within the acoustics alone (Elhilali et al., 2009; Shamma et al., 2011), yet visual cues that precede the audio can serve to reinforce and tune those predictions. Moreover, if the visual input also carries predictive informative about the speech content (say a bilabial vs velar articulation), the listener can derive further processing benefit.

A mechanistic perspective on speech-envelope tracking

From a mechanistic perspective, we can offer two hypotheses as to how the Attention in Time hypothesis could be implemented on the neural level. The first possibility is that auditory cortex contains spectrotemporal representations of both speakers; however, the portion of the auditory response that is temporally coherent with the visual input is selectively amplified. This perspective of binding through temporal coherence is in line with computational perspectives on stream segregation (Elhilali et al., 2009). Alternatively, it has been shown that predictive nonauditory input can reset the phase of low-frequency oscillations in auditory cortex (Lakatos et al., 2007, 2009; Kayser et al., 2008), a mechanism that could be particularly advantageous for improving selective envelope tracking under adverse auditory conditions, such as the Cocktail Party situation. Since low-frequency oscillations govern the timing of neuronal excitability (Buzsáki and Chrobak, 1995; Lakatos et al., 2005; Mizuseki et al., 2009), visually guided phase resets in auditory cortex would align the timing of high neuronal excitability with the timing of attended events, and consequently, events from the to-be-ignored streams would naturally fall on random phases of excitability, contributing to their perceptual attenuation (Schroeder et al., 2008; Zion Golumbic et al., 2012). Both perspectives emphasize the significance of temporal structure in forming a neural representation for speech and selecting the appropriate portion of the auditory scene; however, additional research is needed to fully understand the mechanistic interaction between the visual and auditory speech content.

Another mechanistic question raised by these data are whether visual facilitation of the envelope-tracking response is mediated through multisensory regions, such as the superior temporal and intraparietal sulci, and then fed back to auditory cortex (Beauchamp et al., 2004), or whether it is brought about through feedforward projections from extralemniscal thalamic regions, or direct lateral connections between visual and auditory cortex (Schroeder et al., 2008; Musacchia and Schroeder, 2009). Adjudicating between these possibilities requires additional research; however, the fact that AV influences were observed here as early as 50 ms hints that it is influenced through either thalamocortical or lateral connections between the sensory cortices.

Relationship to previous studies

There is evidence that while watching movies, visual cortex also displays phase locking to the stimulus (Luo et al., 2010). However, in the current study both the phase-dissimilarity/selectivity effects and TRF estimation were localized to auditory cortex, and we found no evidence for phase locking and/or envelope tracking in visual regions. This does not preclude the possibility that there is phase locking to the videos in visual cortex, which might be too weak to pick up using the current experimental design (due to the relatively dull visual input of a face with little motion) or might not be locked to the acoustic envelope.

Previous studies have also shown preferential tracking of attended versus ignored speech even without visual input (Kerlin et al., 2010; Ding and Simon, 2012), similar to classic attentional modulation of evoked responses (Hubel et al., 1959; Hillyard et al., 1973; Tiitinen et al., 1993). These findings contrast with the current results where we failed to find significant attentional modulations in the auditory-only condition. However, critically, in those studies the two speakers were presented from different spatial locations, and not from a central location as in the current study. It is well established that spatial information contributes to stream segregation and attentional selection (Moore and Gockel, 2002; Fritz et al., 2007; Hafter et al., 2007; Elhilali and Shamma, 2008; Elhilali et al., 2009; McDermott, 2009; Shamma et al., 2011). Thus, the attentional effects reported in those studies could have been influenced by spatial cues. Future experiments are needed to assess the relative contribution of visual, spatial, and spectral cues to selective envelope-tracking.

Footnotes

  • This work was supported by funding from the following grants: National Institutes of Health 2R01DC 05660 to D.P., P50 MH086385 to C.E.S., and F32 MH093601 to E.Z.G.

  • Correspondence should be addressed to Elana Zion Golumbic, Division for Therapeutic Brain Stimulation, Department of Psychiatry, Columbia University College of Physicians and Surgeons, New York, NY 10032. ezg2101{at}columbia.edu

References

  1. ↵
    1. Aiken SJ,
    2. Picton TW
    (2008) Human cortical responses to the speech envelope. Ear Hear 29:139–157.
    OpenUrlCrossRefPubMed
  2. ↵
    1. Arnal LH,
    2. Morillon B,
    3. Kell CA,
    4. Giraud AL
    (2009) Dual neural routing of visual facilitation in speech processing. J Neurosci 29:13445–13453.
    OpenUrlAbstract/FREE Full Text
  3. ↵
    1. Arnal LH,
    2. Wyart V,
    3. Giraud AL
    (2011) Transitions in neural oscillations reflect prediction errors generated in audiovisual speech. Nat Neurosci 14:797–801.
    OpenUrlCrossRefPubMed
  4. ↵
    1. Beaman CP,
    2. Bridges AM,
    3. Scott SK
    (2007) From dichotic listening to the irrelevant sound effect: a behavioural and neuroimaging analysis of the processing of unattended speech. Cortex 43:124–134.
    OpenUrlCrossRefPubMed
  5. ↵
    1. Beauchamp MS,
    2. Lee KE,
    3. Argall BD,
    4. Martin A
    (2004) Integration of auditory and visual information about objects in superior temporal sulcus. Neuron 41:809–823.
    OpenUrlCrossRefPubMed
  6. ↵
    1. Bennett AT
    (1980) in Aspects of nonverbal communication, Rhythmic analysis of multiple levels of communicative behavior in face to face interaction, ed Raffler-Engel W (Swets and Zeitleinger, Lisse, the Netherlands), pp 237–251.
  7. ↵
    1. Besle J,
    2. Fort A,
    3. Delpuech C,
    4. Giard MH
    (2004) Bimodal speech: early suppressive visual effects in human auditory cortex. Eur J Neurosci 20:2225–2234.
    OpenUrlCrossRefPubMed
  8. ↵
    1. Birdwhistell RL
    (1970) Kinesics and context: essays on body motion communication (University of Pennsylvania, Philadelphia).
  9. ↵
    1. Bishop CW,
    2. Miller LM
    (2009) A multisensory cortical network for understanding speech in noise. J Cogn Neurosci 21:1790–1805.
    OpenUrlCrossRefPubMed
  10. ↵
    1. Buzsáki G,
    2. Chrobak JJ
    (1995) Temporal structure in spatially organized neuronal ensembles: a role for interneuronal networks. Curr Opin Neurobiol 5:504–510.
    OpenUrlCrossRefPubMed
  11. ↵
    1. Callan D,
    2. Jones JA,
    3. Munhall K,
    4. Callan AM,
    5. Kroos C,
    6. Vatikiotis-Bateson E
    (2003) Neural processes underlying perceptual enhancement by visual speech gestures. Neuroreport 14:2213–2218.
    OpenUrlCrossRefPubMed
  12. ↵
    1. Chandrasekaran C,
    2. Trubanova A,
    3. Stillittano S,
    4. Caplier A,
    5. Ghazanfar AA
    (2009) The natural statistics of audiovisual speech. PLoS Comput Biol 5:e1000436.
    OpenUrlCrossRefPubMed
  13. ↵
    1. Cherry EC
    (1953) Some experiments on the recognition of speech, with one and two ears. J Acoust Soc Am 25:975–979.
    OpenUrlCrossRef
  14. ↵
    1. David SV,
    2. Mesgarani N,
    3. Shamma SA
    (2007) Estimating sparse spectro-temporal receptive fields with natural stimuli. Network 18:191–212.
    OpenUrlCrossRefPubMed
  15. ↵
    1. Davis C,
    2. Kislyuk D,
    3. Kim J,
    4. Sams M
    (2008) The effect of viewing speech on auditory speech processing is different in the left and right hemispheres. Brain Res 1242:151–161.
    OpenUrlCrossRefPubMed
  16. ↵
    1. de Cheveigné A,
    2. Simon JZ
    (2007) Denoising based on time-shift pca. J Neurosci Methods 165:297–305.
    OpenUrlCrossRefPubMed
  17. ↵
    1. Ding N,
    2. Simon JZ
    (2012) Neural coding of continuous speech in auditory cortex during monaural and dichotic listening. J Neurophysiol 107:78–89.
    OpenUrlAbstract/FREE Full Text
  18. ↵
    1. Drullman R
    (2006) in Listening to speech: an auditory perspective, The significance of temporal modulation frequencies for speech intelligibility, eds Greenberg S, Ainsworth W (Lawrence Erlbaum, Mahwah, NJ), pp 39–48.
  19. ↵
    1. Elhilali M,
    2. Shamma SA
    (2008) A cocktail party with a cortical twist: how cortical mechanisms contribute to sound segregation. J Acoust Soc Am 124:3751–3771.
    OpenUrlCrossRefPubMed
  20. ↵
    1. Elhilali M,
    2. Ma L,
    3. Micheyl C,
    4. Oxenham AJ,
    5. Shamma SA
    (2009) Temporal coherence in the perceptual organization and cortical representation of auditory scenes. Neuron 61:317–329.
    OpenUrl
  21. ↵
    1. Fritz JB,
    2. Elhilali M,
    3. David SV,
    4. Shamma SA
    (2007) Auditory attention–focusing the searchlight on sound. Curr Opin Neurobiol 17:437–455.
    OpenUrlCrossRefPubMed
  22. ↵
    1. Ghitza O
    (2011) Linking speech perception and neurophysiology: speech decoding guided by cascaded oscillators locked to the input rhythm. Front Psychol 2:130.
    OpenUrlPubMed
  23. ↵
    1. Giraud AL,
    2. Poeppel D
    (2012) Cortical oscillations and speech processing: emerging computational principles and operations. Nat Neurosci 15:511–517.
    OpenUrlCrossRefPubMed
  24. ↵
    1. Grant KW
    (2001) The effect of speechreading on masked detection thresholds for filtered speech. J Acoust Soc Am 109:2272–2275.
    OpenUrlCrossRefPubMed
  25. ↵
    1. Grant KW,
    2. Seitz PF
    (2000) The use of visible speech cues for improving auditory detection of spoken sentences. J Acoust Soc Am 108:1197–1208.
    OpenUrlCrossRefPubMed
  26. ↵
    1. Hadar U,
    2. Steiner TJ,
    3. Grant EC,
    4. Rose FC
    (1983) Head movement correlates of juncture and stress at sentence level. Lang Speech 26:117–129.
    OpenUrlPubMed
  27. ↵
    1. Hafter E,
    2. Sarampalis A,
    3. Loui P
    (2007) in Auditory perception of sound sources, Auditory attention and filters, eds Yost W, Popper AN, Fay RR (Springer, New York), pp 115–142.
  28. ↵
    1. Hämäläinen MS,
    2. Ilmoniemi RJ
    (1994) Interpreting magnetic fields of the brain: minimum norm estimates. Med Biol Eng Comput 32:35–42.
    OpenUrlCrossRefPubMed
  29. ↵
    1. Helfer KS,
    2. Freyman RL
    (2005) The role of visual speech cues in reducing energetic and informational masking. J Acoust Soc Am 117:842–849.
    OpenUrlCrossRefPubMed
  30. ↵
    1. Hertrich I,
    2. Dietrich S,
    3. Trouvain J,
    4. Moos A,
    5. Ackermann H
    (2012) Magnetic brain activity phase-locked to the envelope, the syllable onsets, and the fundamental frequency of a perceived speech signal. Psychophysiology 49:322–334.
    OpenUrl
  31. ↵
    1. Hillyard SA,
    2. Hink RF,
    3. Schwent VL,
    4. Picton TW
    (1973) Electrical signs of selective attention in the human brain. Science 182:177–180.
    OpenUrlAbstract/FREE Full Text
  32. ↵
    1. Hubel DH,
    2. Henson CO,
    3. Rupert A,
    4. Galambos R
    (1959) “Attention” units in the auditory cortex. Science 129:1279–1280.
    OpenUrlAbstract/FREE Full Text
  33. ↵
    1. Jones MR,
    2. Johnston HM,
    3. Puente J
    (2006) Effects of auditory pattern structure on anticipatory and reactive attending. Cogn Psychol 53:59–96.
    OpenUrlCrossRefPubMed
  34. ↵
    1. Jung TP,
    2. Makeig S,
    3. Westerfield M,
    4. Townsend J,
    5. Courchesne E,
    6. Sejnowski TJ
    (2000) Removal of eye activity artifacts from visual event-related potentials in normal and clinical subjects. Clin Neurophysiol 111:1745–1758.
    OpenUrlCrossRefPubMed
  35. ↵
    1. Kayser C,
    2. Petkov CI,
    3. Logothetis NK
    (2008) Visual modulation of neurons in auditory cortex. Cereb Cortex 18:1560–1574.
    OpenUrlAbstract/FREE Full Text
  36. ↵
    1. Kerlin JR,
    2. Shahin AJ,
    3. Miller LM
    (2010) Attentional gain control of ongoing cortical speech representations in a “cocktail party.” J Neurosci 30:620–628.
    OpenUrlAbstract/FREE Full Text
  37. ↵
    1. Kim J,
    2. Davis C
    (2003) Hearing foreign voices: does knowing what is said affect visual-masked-speech detection? Perception 32:111–120.
    OpenUrlCrossRefPubMed
  38. ↵
    1. Lakatos P,
    2. Shah AS,
    3. Knuth KH,
    4. Ulbert I,
    5. Karmos G,
    6. Schroeder CE
    (2005) An oscillatory hierarchy controlling neuronal excitability and stimulus processing in the auditory cortex. J Neurophysiol 94:1904–1911.
    OpenUrlAbstract/FREE Full Text
  39. ↵
    1. Lakatos P,
    2. Chen CM,
    3. O'Connell MN,
    4. Mills A,
    5. Schroeder CE
    (2007) Neuronal oscillations and multisensory interaction in primary auditory cortex. Neuron 53:279–292.
    OpenUrlCrossRefPubMed
  40. ↵
    1. Lakatos P,
    2. O'Connell MN,
    3. Barczak A,
    4. Mills A,
    5. Javitt DC,
    6. Schroeder CE
    (2009) The leading sense: supramodal control of neurophysiological context by attention. Neuron 64:419–430.
    OpenUrlCrossRefPubMed
  41. ↵
    1. Lalor EC,
    2. Foxe JJ
    (2010) Neural responses to uninterrupted natural speech can be extracted with precise temporal resolution. Eur J Neurosci 31:189–193.
    OpenUrlCrossRefPubMed
  42. ↵
    1. Large EW,
    2. Jones MR
    (1999) The dynamics of attending: how people track time-varying events. Psychological Rev 106:119–159.
    OpenUrlCrossRef
  43. ↵
    1. Luo H,
    2. Poeppel D
    (2007) Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex. Neuron 54:1001–1010.
    OpenUrlCrossRefPubMed
  44. ↵
    1. Luo H,
    2. Liu Z,
    3. Poeppel D
    (2010) Auditory cortex tracks both auditory and visual stimulus dynamics using low-frequency neuronal phase modulation. PLoS Biol 8:e1000445.
    OpenUrlCrossRefPubMed
  45. ↵
    1. McDermott JH
    (2009) The cocktail party problem. Curr Biol 19:R1024–R1027.
    OpenUrlCrossRefPubMed
  46. ↵
    1. McGettigan C,
    2. Faulkner A,
    3. Altarelli I,
    4. Obleser J,
    5. Baverstock H,
    6. Scott SK
    (2012) Speech comprehension aided by multiple modalities: behavioural and neural interactions. Neuropsychologia 50:762–776.
    OpenUrlCrossRefPubMed
  47. ↵
    1. Mesgarani N,
    2. Chang EF
    (2012) Selective cortical representation of attended speaker in multi-talker speech perception. Nature 485:233–236.
    OpenUrlCrossRefPubMed
  48. ↵
    1. Mizuseki K,
    2. Sirota A,
    3. Pastalkova E,
    4. Buzsáki G
    (2009) Theta oscillations provide temporal windows for local circuit computation in the entorhinal-hippocampal loop. Neuron 64:267–280.
    OpenUrlCrossRefPubMed
  49. ↵
    1. Moore BCJ,
    2. Gockel H
    (2002) Factors influencing sequential stream segregation. Acta Acustica 88:320–332.
    OpenUrl
  50. ↵
    1. Moray N
    (1959) Attention in dichotic listening: affective cues and the influence of instructions. Q J Exp Psychol 11:56–60.
    OpenUrlCrossRef
  51. ↵
    1. Munhall KG,
    2. Jones JA,
    3. Callan DE,
    4. Kuratate T,
    5. Vatikiotis-Bateson E
    (2004) Visual prosody and speech intelligibility. Psychol Sci 15:133–137.
    OpenUrlAbstract/FREE Full Text
  52. ↵
    1. Musacchia G,
    2. Schroeder CE
    (2009) Neuronal mechanisms, response dynamics and perceptual functions of multisensory interactions in auditory cortex. Hear Res 258:72–79.
    OpenUrlCrossRefPubMed
  53. ↵
    1. Nobre AC,
    2. Coull JT
    (2010) Attention and time (Oxford UP, Oxford, UK).
  54. ↵
    1. Nobre A,
    2. Correa A,
    3. Coull J
    (2007) The hazards of time. Curr Opin Neurobiol 17:465–470.
    OpenUrlCrossRefPubMed
  55. ↵
    1. Oldfield RC
    (1971) The assessment and analysis of handedness: the Edinburgh inventory. Neuropsychologia 9:97–113.
    OpenUrlCrossRefPubMed
  56. ↵
    1. O'Neill JJ
    (1954) Contributions of the visual components of oral symbols to speech comprehension. J Speech Hearing Disord 19:429–439.
    OpenUrlPubMed
  57. ↵
    1. Oostenveld R,
    2. Fries P,
    3. Maris E,
    4. Schoffelen J-M
    (2011) Fieldtrip: open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intel Neurosci, 2011.
  58. ↵
    1. Peelle JE,
    2. Gross J,
    3. Davis MH
    (2012) Phase-locked responses to speech in human auditory cortex are enhanced during comprehension. Cereb Cortex doi:10.1093/cercor/bhs118, Advance online publication. Retrieved May 17, 2012.
    OpenUrlAbstract/FREE Full Text
  59. ↵
    1. Rosen S
    (1992) Temporal information in speech: acoustic, auditory and linguistic aspects. Philos Trans R Soc Lond B Biol Sci 336:367–373.
    OpenUrlAbstract/FREE Full Text
  60. ↵
    1. Ross LA,
    2. Saint-Amour D,
    3. Leavitt VM,
    4. Javitt DC,
    5. Foxe JJ
    (2007) Do you see what i am saying? Exploring visual enhancement of speech comprehension in noisy environments. Cereb Cortex 17:1147–1153.
    OpenUrlAbstract/FREE Full Text
  61. ↵
    1. Scarborough R,
    2. Keating P,
    3. Mattys SL,
    4. Cho T,
    5. Alwan A
    (2009) Optical phonetics and visual perception of lexical and phrasal stress in English. Lang Speech 52:135–175.
    OpenUrlCrossRefPubMed
  62. ↵
    1. Schoffelen JM,
    2. Gross J
    (2009) Source connectivity analysis with MEG and EEG. Hum Brain Mapp 30:1857–1865.
    OpenUrlCrossRefPubMed
  63. ↵
    1. Schroeder CE,
    2. Lakatos P,
    3. Kajikawa Y,
    4. Partan S,
    5. Puce A
    (2008) Neuronal oscillations and visual amplification of speech. Trends Cogn Sci 12:106–113.
    OpenUrlCrossRefPubMed
  64. ↵
    1. Schwartz JL,
    2. Berthommier F,
    3. Savariaux C
    (2004) Seeing to hear better: evidence for early audio-visual interactions in speech identification. Cognition 93:B69–B78.
    OpenUrlCrossRefPubMed
  65. ↵
    1. Sekiyama K,
    2. Kanno I,
    3. Miura S,
    4. Sugita Y
    (2003) Auditory-visual speech perception examined by fMRI and PET. Neurosci Res 47:277–287.
    OpenUrlCrossRefPubMed
  66. ↵
    1. Shamma SA,
    2. Elhilali M,
    3. Micheyl C
    (2011) Temporal coherence and attention in auditory scene analysis. Trends Neurosci 34:114–123.
    OpenUrlCrossRefPubMed
  67. ↵
    1. Shannon RV,
    2. Zeng FG,
    3. Kamath V,
    4. Wygonski J,
    5. Ekelid M
    (1995) Speech recognition with primarily temporal cues. Science 270:303–304.
    OpenUrlAbstract/FREE Full Text
  68. ↵
    1. Studebaker GA
    (1985) A “rationalized” arcsine transform. J Speech Hear Res 28:455–462.
    OpenUrlPubMed
  69. ↵
    1. Sumby WH,
    2. Pollack I
    (1954) Visual contribution to speech intelligibility in noise. J Acoust Soc Am 26:212–215.
    OpenUrlCrossRef
  70. ↵
    1. Theunissen FE,
    2. David SV,
    3. Singh NC,
    4. Hsu A,
    5. Vinje WE,
    6. Gallant JL
    (2001) Estimating spatio-temporal receptive fields of auditory and visual neurons from their responses to natural stimuli. Network 12:289–316.
    OpenUrlCrossRefPubMed
  71. ↵
    1. Tiitinen H,
    2. Sinkkonen J,
    3. Reinikainen K,
    4. Alho K,
    5. Lavikainen J,
    6. Näätänen R
    (1993) Selective attention enhances the auditory 40-hz transient response in humans. Nature 364:59–60.
    OpenUrlCrossRefPubMed
  72. ↵
    1. van Wassenhove V,
    2. Grant KW,
    3. Poeppel D
    (2005) Visual speech speeds up the neural processing of auditory speech. Proc Natl Acad Sci U S A 102:1181–1186.
    OpenUrlAbstract/FREE Full Text
  73. ↵
    1. Woldorff MG,
    2. Gallen CC,
    3. Hampson SA,
    4. Hillyard SA,
    5. Pantev C,
    6. Sobel D,
    7. Bloom FE
    (1993) Modulation of early sensory processing in human auditory cortex during auditory selective attention. Proc Natl Acad Sci U S A 90:8722–8726.
    OpenUrlAbstract/FREE Full Text
  74. ↵
    1. Wood N,
    2. Cowan N
    (1995) The cocktail party phenomenon revisited: how frequent are attention shifts to one's name in an irrelevant auditory channel? J Exp Psychol Learn Mem Cogn 21:255–260.
    OpenUrlCrossRefPubMed
  75. ↵
    1. Zion Golumbic EM,
    2. Poeppel D,
    3. Schroeder CE
    (2012) Temporal context in speech processing and attentional stream selection: a behavioral and neural perspective. Brain Lang 122:151–161.
    OpenUrlCrossRefPubMed
Back to top

In this issue

The Journal of Neuroscience: 33 (4)
Journal of Neuroscience
Vol. 33, Issue 4
23 Jan 2013
  • Table of Contents
  • Table of Contents (PDF)
  • About the Cover
  • Index by author
  • Advertising (PDF)
  • Ed Board (PDF)
Email

Thank you for sharing this Journal of Neuroscience article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a “Cocktail Party”
(Your Name) has forwarded a page to you from Journal of Neuroscience
(Your Name) thought you would be interested in this article in Journal of Neuroscience.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Print
View Full Page PDF
Citation Tools
Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a “Cocktail Party”
Elana Zion Golumbic, Gregory B. Cogan, Charles E. Schroeder, David Poeppel
Journal of Neuroscience 23 January 2013, 33 (4) 1417-1426; DOI: 10.1523/JNEUROSCI.3675-12.2013

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Respond to this article
Request Permissions
Share
Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a “Cocktail Party”
Elana Zion Golumbic, Gregory B. Cogan, Charles E. Schroeder, David Poeppel
Journal of Neuroscience 23 January 2013, 33 (4) 1417-1426; DOI: 10.1523/JNEUROSCI.3675-12.2013
Twitter logo Facebook logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

  • Article
    • Abstract
    • Introduction
    • Materials and Methods
    • Results
    • Discussion
    • Footnotes
    • References
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF

Responses to this article

Respond to this article

Jump to comment:

No eLetters have been published for this article.

Related Articles

Cited By...

More in this TOC Section

Articles

  • Memory Retrieval Has a Dynamic Influence on the Maintenance Mechanisms That Are Sensitive to ζ-Inhibitory Peptide (ZIP)
  • Neurophysiological Evidence for a Cortical Contribution to the Wakefulness-Related Drive to Breathe Explaining Hypocapnia-Resistant Ventilation in Humans
  • Monomeric Alpha-Synuclein Exerts a Physiological Role on Brain ATP Synthase
Show more Articles

Behavioral/Cognitive

  • Zooming in and out: Selective attention modulates color signals in early visual cortex for narrow and broad ranges of task-relevant features
  • Target selection signals causally influence human perceptual decision making
  • The molecular substrates of second-order conditioned fear in the basolateral amygdala complex
Show more Behavioral/Cognitive
  • Home
  • Alerts
  • Follow SFN on BlueSky
  • Visit Society for Neuroscience on Facebook
  • Follow Society for Neuroscience on Twitter
  • Follow Society for Neuroscience on LinkedIn
  • Visit Society for Neuroscience on Youtube
  • Follow our RSS feeds

Content

  • Early Release
  • Current Issue
  • Issue Archive
  • Collections

Information

  • For Authors
  • For Advertisers
  • For the Media
  • For Subscribers

About

  • About the Journal
  • Editorial Board
  • Privacy Notice
  • Contact
  • Accessibility
(JNeurosci logo)
(SfN logo)

Copyright © 2025 by the Society for Neuroscience.
JNeurosci Online ISSN: 1529-2401

The ideas and opinions expressed in JNeurosci do not necessarily reflect those of SfN or the JNeurosci Editorial Board. Publication of an advertisement or other product mention in JNeurosci should not be construed as an endorsement of the manufacturer’s claims. SfN does not assume any responsibility for any injury and/or damage to persons or property arising from or related to any use of any material contained in JNeurosci.