The conditions of everyday life are such that people often hear speech that has been degraded (e.g., by background noise or electronic transmission) or when they are distracted by other tasks. However, it remains unclear what role attention plays in processing speech that is difficult to understand. In the current study, we used functional magnetic resonance imaging to assess the degree to which spoken sentences were processed under distraction, and whether this depended on the acoustic quality (intelligibility) of the speech. On every trial, adult human participants attended to one of three simultaneously presented stimuli: a sentence (at one of four acoustic clarity levels), an auditory distracter, or a visual distracter. A postscan recognition test showed that clear speech was processed even when not attended, but that attention greatly enhanced the processing of degraded speech. Furthermore, speech-sensitive cortex could be parcellated according to how speech-evoked responses were modulated by attention. Responses in auditory cortex and areas along the superior temporal sulcus (STS) took the same form regardless of attention, although responses to distorted speech in portions of both posterior and anterior STS were enhanced under directed attention. In contrast, frontal regions, including left inferior frontal gyrus, were only engaged when listeners were attending to speech and these regions exhibited elevated responses to degraded, compared with clear, speech. We suggest this response is a neural marker of effortful listening. Together, our results suggest that attention enhances the processing of degraded speech by engaging higher-order mechanisms that modulate perceptual auditory processing.
Conversations in everyday life are often made more challenging by poor listening conditions that degrade speech (e.g., electronic transmission, background noise) or by tasks that distract us from our conversational partner. Research exploring how we perceive degraded speech typically considers situations in which speech is the sole (or target) signal, and not how distraction may influence speech processing (Miller et al., 1951; Kalikow et al., 1977; Pichora-Fuller et al., 1995; Davis et al., 2005). Attention may play a critical role in processing speech that is difficult to understand.
It has been hypothesized that perceiving degraded speech consumes more attentional resources than does clear speech (Rabbitt, 1968, 1990). This “effortful listening” hypothesis is usually tested indirectly by showing that attending to degraded (compared with clear) speech interferes with downstream cognitive processing, such as encoding words into memory (Rabbitt, 1990; Murphy et al., 2000; Stewart and Wingfield, 2009). Here, we examine how distraction (compared with full attention) affects the processing of spoken sentences: if processing degraded speech requires more attentional resources than clear speech, then distraction should interfere more with the processing of degraded speech. Using functional magnetic resonance imaging (fMRI), we tested this hypothesis by directly comparing neural responses to degraded and clear sentences when these stimuli are attended or unattended.
Under directed attention, spoken sentence comprehension activates a distributed network of brain areas involving left frontal and bilateral temporal cortex (Davis and Johnsrude, 2003; Davis et al., 2007; Hickok and Poeppel, 2007; Obleser et al., 2007). This speech-sensitive cortex is arranged in a functional hierarchy: cortically early regions (e.g., primary auditory cortex) are sensitive to the acoustic form of speech, whereas activity in higher-order temporal and frontal regions correlates with speech intelligibility regardless of acoustic characteristics (Davis and Johnsrude, 2003), suggesting that these areas contribute to the processing of more abstract linguistic information. Frontal and periauditory regions, which respond more actively to degraded, compared with clear, speech, have been proposed to compensate for distortion (Davis and Johnsrude, 2003, 2007; Shahin et al., 2009; Wild et al., 2012). We expected that attention would selectively modulate speech-evoked responses in these higher order areas, because lower-level periauditory responses to speech do not seem to depend on attentional state (Heinrich et al., 2011).
In the present study, we use fMRI to compare how sentences of varying acoustic clarity—and hence, intelligibility—are processed when attended, or ignored in favor of engaging distracter tasks. We also use a recognition memory posttest to assess how well sentences from the scanning session are processed as a function of stimulus clarity and attentional state. This factorial design, with intelligibility and attentional task as main effects, allows us to identify regions that are not only sensitive to differences in speech intelligibility or attentional focus, but, critically, areas where the processing of speech depends on attention (i.e., the interaction). Elevated responses to degraded speech that occur only when attention is directed toward speech would suggest a neural signature of effortful listening.
Materials and Methods
We tested 21 undergraduate students (13 females) between 19 and 27 years of age (mean, 21 years; SD, 3.0 years) from Queen's University (Ontario, Canada). All participants were recruited through poster advertisement and the Queen's Psychology 100 Subject Pool. A separate group of 13 undergraduate students (11 females, 18–35 years of age) were tested to pilot the materials and the procedure. They underwent the same experimental protocol as the other participants, including the presentation of all three stimulus sources, in an isolated soundbooth.
All subjects were right-handed native speakers of English, with normal self-reported hearing, normal or corrected-to-normal vision, and no known attentional or language processing impairments. Participants reported no history of seizures or psychiatric or neurological disorders, and no current use of any psychoactive medications. Participants also complied with magnetic resonance imaging safety standards: they reported no prior surgeries involving metallic implants, devices, or objects. This study was cleared by the Health Sciences and Affiliated Teaching Hospitals Research Ethics Board (Kingston, ON, Canada), and written informed consent was received from all subjects.
To avoid acoustic confounds associated with continuous echoplanar imaging, fMRI scanning was conducted using a sparse imaging design (Edmister et al., 1999; Hall et al., 1999) in which stimuli were presented in the 7 s silent gap between successive 2 s volume acquisitions. On every trial, subjects were cued to attend to one of three simultaneously presented stimuli (Fig. 1)—a sentence [speech stimulus (SP)], an auditory distracter (AD), or a visual distracter (VD)—and performed a decision task associated with the attended stimulus. The speech stimulus on every trial was presented at one of four levels of clarity (for details of stimulus creation, see Speech stimuli, below). Together, these yielded a factorial design with 12 conditions (4 speech intelligibility levels × 3 attention conditions). A silent baseline condition was also included: participants simply viewed a fixation cross, and no other stimuli were presented.
Sentence stimuli consisted of 216 meaningful English sentences (e.g., “His handwriting was very difficult to read”) recorded by a female native speaker of North American English in a soundproof chamber using an AKG C1000S microphone with an RME Fireface 400 audio interface (sampling at 16 bits, 44.1 khz). We manipulated speech clarity, and hence intelligibility, using a noise-vocoding technique (Shannon et al., 1995) that preserves the temporal information in the speech envelope but reduces the amount of spectral clarity. Noise-vocoded (NV) stimuli were created by filtering each audio recording into contiguous (approximately) logarithmically spaced frequency bands (selected to be approximately equally spaced along the basilar membrane) (Greenwood, 1990). Filtering was performed using finite impulse response Hann bandpass filters with a window length of 801 samples. The amplitude envelope from each frequency band was extracted by full-wave rectifying the band-limited signal and applying a low-pass filter (30 Hz cutoff, using a fourth-order Butterworth filter). Each envelope was then applied to bandpass-filtered noise of the same frequency range, and all bands were recombined to produce the final NV utterance.
With this process, we created four levels of speech varying in acoustic clarity (Fig. 2): clear speech, which is easily understood and highly intelligible; six-band NV stimuli (NV-hi), which is spectrally degraded but still quite intelligible; compressed six-band NV stimuli (NV-lo), which is more difficult to understand than regular six-band NV stimuli; and spectrally rotated NV (rNV) stimuli, which is acoustically very similar to NV stimuli, but impossible to understand. NV-lo stimuli differ from NV-hi items in that their channel envelopes were amplitude-compressed (by taking the square root) to reduce dynamic range before applying them to the noise carriers. To create rNV stimuli items, the envelope from the lowest frequency band was applied to the highest frequency noise band (and vice versa), the envelope from the second lowest band was applied to the second highest band (and vice versa), and envelopes from the inner two bands were swapped. Spectrally rotated speech is completely unintelligible but retains the same overall temporal profile and spectral complexity as the nonrotated version, and hence serves as a closely matched control. After processing, all stimuli (864 audio files) were normalized to have the same total root mean square power.
Twelve sets of 18 sentences were constructed from the corpus of 216 items. The sets were statistically matched for number of words (mean = 9.0, SD = 2.2), number of syllables (mean = 20.1, SD = 7.3), length in milliseconds (mean = 2499, SD = 602.8), and the logarithm of the sum word frequency (Thorndike and Lorge written frequency, mean = 5.5, SD = 0.2). Each set of sentences was assigned to one of the 12 experimental conditions, such that sets and conditions were counterbalanced across subjects to eliminate item-specific effects.
The pilot participants, when instructed to attend to the speech stimulus, repeated back as much of the sentence as they could, which was scored to give a percentage of words correct measure of attended speech intelligibility. A repeate-measures ANOVA on the average proportion of words reported correctly showed a significant main effect of speech type (F(3,36) = 451.21, p < 0.001; Fig. 2), and post hoc tests (Bonferroni corrected for multiple comparisons) showed that that clear speech was reported more accurately than NV-hi (t(12) = 5.38, p < 0.001), which was reported more accurately than NV-lo (t(12) = 5.55, p < 0.001), which was reported more accurately than rNV speech (t(12) = 20.38, p < 0.001).
Auditory distracters were sequences of 400 ms narrow-band ramped noise bursts separated by a variable amount of silence (220–380 ms). The number of sounds in each sequence was selected so that the durations of the auditory distracter and the sentence stimulus were approximately equal (Fig. 1). Each noise burst was created by passing 400 ms of broadband white noise through a filter with a fixed bandwidth of 1000 Hz and a center frequency that was randomly selected to be between 4500 and 5500 Hz. The noise bursts were amplitude-modulated to create linear onsets of 380 ms and sharp linear offsets of 20 ms. Target sounds in this stream possessed a sharp onset (20 ms) and a long offset (380 ms) (Fig. 1). Half of all experimental trials were selected to contain a single target sound, which never occurred first in the sequence of noise bursts.
Auditory stimuli (distracter sequences and speech stimuli) were presented diotically over MR-compatible high-fidelity electrostatic earphones, placed in ear defenders that attenuated the background sound of the scanner by ∼30 dB (NordicNeurolab AudioSystem).
Data from the auditory distracter task were analyzed using signal detection theory by comparing the z-score of the proportion of hits to the z-score of the proportion of false alarms, yielding a d′ score for each participant. For the pilot group, the average d′ score was 2.30 (SD = 0.88), which was significantly greater than chance (d′ > 0; t(12) = 9.51, p < 0.001), indicating that participants were able to perform the target detection task.
The visual distracters were series of cross-hatched white ellipses presented on a black background (Fig. 1), length-matched to the duration of the speech stimulus on every trial. These visual stimuli have been shown to be effective distracters in other experiments manipulating focus of attention (Carlyon et al., 2003). Every 200 ms, a new ellipse, which randomly varied in terms of horizontal and vertical scaling factors and could be reflected in the vertical or horizontal axis, was presented (Fig. 1). Half of all trials (balanced across experimental conditions) were selected to contain a visual target: an ellipse with dashed, instead of solid, lines. If present in a trial, the visual target would appear within ±1 s of the midpoint of the speech stimulus. Visual stimuli were displayed by a video projector on to a rear-projection screen viewed by participants through an angled mirror mounted on the head coil.
Again, data were analyzed with signal detection theory to give a d′ score for each participant. For the pilot group, the average d′ score was 3.89 (SD = 0.35), which was significantly greater than chance levels (i.e., d′ > 0; t(12) = 39.91, p < 0.001), indicating that participants were able to perform the target detection task.
On each trial, participants were cued to attend to a single stimulus stream with a visual prompt presented during the scan of the previous trial (Fig. 1). The cue word “Speech” instructed participants to attend to the speech stimulus, “Chirps” cued the participants to the auditory distracter, and “Footballs” cued the visual sequence.
When cued to attend to the speech stimulus, participants listened and indicated at the end of the trial whether or not they understood the gist of the sentence (with a two-alternative, yes/no keypress), providing a measure of the intelligibility of the attended speech. When cued to attend to the visual or auditory distracter, participants monitored the stream for a single target stimulus and indicated at the end of the trial whether or not the target was present (with a two-alternative, yes/no keypress). Subjects were instructed to press either button at the end of each silent trial. A response window of 1.5 s (prompted by the word “Respond”) occurred before the onset of the image acquisition period (Fig. 1). Participants made their responses with a button wand held in their right hand, using the index finger button for “yes” and the middle finger button for “no.”
Participants experienced 18 trials of each of the 12 experimental conditions (4 speech types × 3 attention tasks) and 10 trials of the silent baseline (226 trials total). The 226 trials were divided into four blocks of 56 or 57 trials, each with approximately the same number of trials from each condition. Two extra images were added to the start of each block to allow the magnetization to reach a steady state; these dummy images were discarded from all preprocessing and analysis steps. We implemented an event-related design such that participants did not know which task they would perform on any given trial until a cue appeared. However, we reduced task switching to make the experiment easier on participants by constraining the number of consecutive trials with the same task. The distribution was approximately Poisson shaped, such that it was most likely for there to be at least two trials in a row with the same task, but never more than six in a row. Despite the pseudorandomized distribution of tasks, the speech stimulus on every trial was fully randomized and silent trials were fully interspersed throughout the experiment.
All participants (including pilot subjects who generated the performance data reported in Participants, above) underwent extensive training on all three tasks—on each task individually and with task switching. Because the intelligibility of NV speech depends on experience, all participants were also trained with six-band NV stimuli before the study to ensure that intelligibility of the NV-hi and NV-lo stimuli were asymptotic before beginning the actual experiment (Davis et al., 2005).
Immediately after the scanning session, subjects performed a surprise recognition task. This posttest measured participants' memory for a subset (50%) of sentences they heard during the experiment (half of the sentences from each condition randomly selected for each participant). An additional 55 new foil sentences (recorded by the same female speaker) were intermixed with the 108 target sentences. Subjects made an old/new discrimination for each stimulus, responding via button press. Sensitivity (d′) for each condition was determined by comparing the z-score of the proportion of hits in each condition (out of a maximum of 9) to the z-score of the proportion of false alarms for all foils. These scores were analyzed using two-factor (speech type × attention task) repeated-measures ANOVA.
fMRI protocol and data acquisition.
Imaging was performed on the 3.0 tesla Siemens Trio MRI system in the Queens Centre for Neuroscience Studies MR Facility (Kingston, Ontario, Canada). T2*-weighted functional images were acquired using GE-EPI sequences (field of view, 211 mm × 211 mm; in-plane resolution, 3.3 mm × 3.3 mm; slice thickness, 3.3 mm with a 25% gap; TA, 2000 ms per volume; TR, 9000 ms; TE, 30 ms; flip angle, 78°). Acquisition was transverse oblique, angled away from the eyes, and in most cases covered the whole brain (in a very few cases, slice positioning excluded the top of the superior parietal lobule). Each stimulus sequence was positioned in the silent interval such that the middle of the sequence occurred 4 s before the onset of the next scan (Fig. 1). In addition to functional data, a whole-brain 3D T1-weighted anatomical image (voxel resolution, 1.0 mm3) was acquired for each participant at the start of the session.
fMRI data preprocessing.
fMRI data were processed and analyzed using Statistical Parametric Mapping (SPM8; Wellcome Centre for Neuroimaging, London, UK). Data preprocessing steps for each subject included: (1) rigid realignment of each EPI volume to the first of the session; (2) coregistration of the structural image to the mean EPI; (3) normalization of the structural image to common subject space (with a subsequent affine registration to MNI space) using the group-wise DARTEL registration method included with SPM8 (Ashburner, 2007); and (4) warping of all functional volumes using deformation flow fields generated from normalization step, which simultaneously resampled the images to isotropic 3 mm voxels and spatially smoothed them with a three-dimensional Gaussian kernel with a full-width at half-maximum of 8 mm. Application of this smoothing kernel resulted in an estimated smoothness of ∼15 mm in the group analyses.
Analysis of each participant's data was conducted using a general linear model in which each scan was coded as belonging to one of 13 conditions. The four runs were modeled as one session within the design matrix, and four regressors were used to remove the mean signal from each of the runs. Six realignment parameters were included to account for movement-related effects (i.e., three degrees of freedom for translational movement in the x, y, and z directions, and three degrees of freedom for rotational motion: yaw, pitch, and roll). Two additional regressors coded the presence of a target in the visual and auditory distracter streams. Button presses were not modeled because a button was pressed on every trial. Due to the long TR of this sparse-imaging paradigm, no correction for serial autocorrelation was necessary. A high-pass filter with a cutoff of 216 s was modeled to eliminate low-frequency signal confounds such as scanner drift. These models were then fitted using a least-mean-squares method to each individual's data, and parameter estimates were obtained. Contrast images for each of the 12 experimental conditions were generated by comparing each of the condition parameter estimates (i.e., 12 betas) to the silent baseline condition. These images were primarily used to obtain plots of estimated signal within voxels for each condition.
The group-level analysis was conducted using a 4 (Speech Type) × 3 (Attentional Task) factorial partitioned-error repeated-measures ANOVA, in which separate models were constructed for each main effect and for the interaction of the two factors (Henson and Penny, 2003). For whole-brain analyses of the main effects and their interaction, we used a voxelwise threshold of p < 0.05, corrected for multiple comparisons over the whole brain using a nonparametric permutation test as implemented in SnPM (www.sph.umich.edu/ni-stat/SnPM) (Nichols and Holmes, 2002). This test has been shown to have strong control over experiment-wise type I error (Holmes et al., 1996).
A significant main effect or interaction in an ANOVA can be driven by many possible simple effects. In our study, for example, a main effect of speech type might mean that activity correlates with intelligibility (i.e., high activity for clear speech, intermediate activity for degraded speech, and low activity for unintelligible speech), that activity is increased for degraded compared with clear speech, or that there is some other difference in activity between the four levels of the speech type factor. Therefore, the thresholded F-statistic images showing overall main effects (and interaction) were parsed into simple effects by inclusively masking with specific t-contrast images (i.e., simple effects) that were thresholded at p < 0.001, uncorrected. The t-contrasts were combined to determine logical intersections of the simple effects; in this way, significant voxels revealed by F-contrasts were labeled as being driven by one or more simple effects. Peaks were localized using the LONI probabilistic brain atlas (LPBA40) (Shattuck et al., 2008) and confirmed by visual inspection of the average structural image. Results of the fMRI analysis are shown on the average normalized T1-weighted structural image.
Due to technical difficulties with the stimulus-delivery and response-collection computer program, behavioral and fMRI data were unavailable for two subjects. Analyses of fMRI data, and behavioral data obtained during scanning, are based on the remaining 19 subjects. Posttest data were unavailable for one subject, and so the results of this test are based on data from 20 subjects.
When attending to speech stimuli, participants indicated on every trial whether or not they understood the gist of the sentence. A one-way repeated-measures ANOVA of the proportion of sentences understood, treating speech type as a four-level within-subjects factor, demonstrated a significant main effect of speech type (F(3,54) = 275.34, p < 0.001; Fig. 2). Post hoc pairwise comparisons, corrected for multiple comparisons (Sidak, 1967), indicated that these subjective reports of intelligibility did not reliably differ between clear speech and NV-hi items. NV-hi sentences were reported as understood significantly more often than NV-lo (t(18) = 5.90, p < 0.001), which were reported as understood significantly more often than rNV sentences (t(18) = 13.70, p < 0.001). These results closely matched the intelligibility data collected from the pilot group (Fig. 2).
Mean sensitivities (i.e., d′) for the auditory and visual distracter tasks were 2.15 (SD = 1.30) and 3.17 (SD = 0.55), respectively. Both scores were significantly greater than chance levels (t(18) = 7.22, p < 0.001; t(18) = 25.02, p < 0.001), suggesting that participants in the scanner were attending to the correct stimulus stream. A pairwise comparison showed that the auditory distraction task was significantly more challenging than the visual task (t(18) = 7.22, p < 0.005).
The d′ scores were also broken down by speech type to see whether the unattended speech stimulus affected performance in the distracter tasks. ANOVAs showed no significant effect of the unattended speech type on target detection performance for the auditory or visual distracter tasks.
Old/new discrimination posttest
Results of the posttest are shown in Figure 3. There was a significant main effect of Speech Type (F(3,57) = 29.12, p < 0.001), with a pattern similar to the pilot subjects, where recognition scores for Clear and NV-hi sentences were not reliably different, but recognition for NV-hi items was significantly better than for NV-lo (t(19) = 4.48, p < 0.001), which was significantly greater than recognition of rNV sentences (t(19) = 3.50, p < 0.005). There was also a significant main effect of Attention (F(2,38) = 11.25, p < 0.001), such that d′ values were significantly higher for attended sentences compared with those presented when attention was elsewhere (SP > AD: t(19) = 4.23, p < 0.001; SP > VD: t(19) = 3.74, p < 0.001). We note that memory for sentences presented during the distraction tasks did not differ significantly. Importantly, there was a significant Speech Type × Attention interaction (F(6,114) = 3.61, p < 0.005), where pairwise (Sidak-corrected) comparisons showed that recognition of degraded sentences (i.e., NV-hi and NV-lo) was significantly enhanced by attention to the speech stimulus (NV-hi SP > AD: t(19) = 3.27, p < 0.005; NV-hi SP > VD: t(19) = 3.93, p < 0.001; NV-lo SP > AD: t(19) = 3.46, p < 0.005; NV-lo SP > VD: t(19) = 3.64, p < 0.005), whereas attention had no effect on the recognition of clear speech or rotated NV speech items (Fig. 3A).
The Speech Type × Attention interaction can also be explained by comparing how recognition of (potentially intelligible) noise-vocoded speech items compares to clear speech across attentional tasks. For attended speech, recognition of clear speech sentences did not differ from NV-hi or NV-lo. However, when attention was directed toward the auditory distracter, clear sentences were remembered significantly better than NV-lo (t(19) = 4.76, p < 0.001), but not NV-hi, and when attention was directed toward the visual distracter, recognition of clear sentences was better than both NV-hi (t(19) = 3.63, p < 0.01) and NV-lo (t(19) = 4.15, p < 0.005).
One-sample t tests were conducted on d′ scores for each condition (12 per group) to determine whether recognition of sentences presented in those conditions was greater than chance (i.e., d′ > 0). Performance was significantly better than chance (d′ > 0; p < 0.05, Bonferroni-corrected for 12 comparisons) for all clear and high-intelligibility NV speech conditions and for attended NV-lo items. Recognition of unattended NV-lo items did not differ from chance. The unintelligible rNV stimuli were never recognized above chance levels.
Main effect of speech type
The contrast assessing the main effect of speech type revealed activation of left inferior frontal gyrus (LIFG) and large bilateral activations of the temporal cortex, ranging along the full length of the superior temporal gyrus (STG), superior temporal sulcus (STS), and the superior temporal plane (Fig. 4; Table 1). There are many ways in which four speech-type conditions can differ from each other, but we were interested in two specific patterns of difference, which we tested with specific t contrasts. First, we searched for areas where blood oxygen level-dependent (BOLD) signal correlated with speech intelligibility scores (i.e., an intelligibility-related response). Intelligibility scores collected from the pilot subjects were used to construct this contrast because they provided a more objective and continuous measure than the binary subjective response made by participants in the scanner (Davis and Johnsrude, 2003). The pilot and in-MR measures were highly correlated with each other (Fig. 2). Second, a noise-elevated response, which was assessed with the contrast (NV-hi + NV-lo)/2 > Clear was used to identify regions that were more responsive to degraded than clear speech, and therefore might be involved in compensating for acoustic degradation. The unintelligible rNV stimuli were not included in this contrast (i.e., weighted with a zero), because it is not clear whether listeners would try very hard to understand them or just give up. Responses within bilateral temporal and inferior frontal regions demonstrated a significant correlation with intelligibility (Fig. 4, dark blue voxels), largely consistent with a previous correlational intelligibility analysis (Davis and Johnsrude, 2003). Noise-elevated responses were found in left premotor (Fig. 4, top left) and bilateral insular cortex. These did not overlap with any regions that demonstrated a correlation with intelligibility. Interestingly, a noise-elevated response was not observed in left inferior frontal cortex as might have been expected from previous findings (Davis et al., 2003; Giraud et al., 2004; Shahin et al., 2009). The lack of a noise-elevated response in LIFG collapsed across attention conditions is due to a strong interaction with attention, as we discuss below.
Main effect of attention task
The contrast assessing the main effect of attention condition revealed widespread activity (Fig. 5; Table 2). We tested for two simple effects: regions where attention to an auditory stimulus resulted in enhanced responses compared with attention to the visual stimulus [(SP + AD) > 2VD] and areas that demonstrated the opposite pattern [2VD > (SP + AD)]. In accordance with previous research, we observed that attention modulated activity in sensory cortices, such that responses in the sensory cortex corresponding to the modality of the attended stimulus were enhanced (Heinze et al., 1994; Petkov et al., 2004; Johnson and Zatorre, 2005, 2006; Heinrich et al., 2011). This confirmed that our attentional manipulation was effective.
Interaction (Speech Type × Attention Task)
Most interesting were areas that demonstrated an interaction between Speech Type and Attention; that is, areas in which the relationship between acoustic quality of sentences and BOLD signal depended on the listeners' attentional state. Several clusters of voxels demonstrated a significant interaction (Fig. 6, bright blue voxels; Table 3). These included bilateral posterior STS/STG, left anterior STS/STG, the LIFG (specifically partes triangularis and opercularis), bilateral angular gyri, bilateral anterior insulae, left supplementary motor area (SMA), and the caudate. Interestingly, areas of the STG corresponding to primary auditory cortex and much of the superior temporal plane showed no evidence of an interaction, even at a threshold of p < 0.05, uncorrected for multiple comparisons (Fig. 6, red voxels).
Attention influences speech processing differently in frontal and temporal cortex
It is possible that speech-evoked responses in these areas are modulated by attention to different extents or in different ways. Such a difference would manifest as a three-way (Region × Speech Type × Attention) interaction. To quantitatively compare the interaction patterns in temporal and frontal cortices, we conducted three-way repeated-measures ANOVAs on the parameter estimates extracted from four areas: left anterior STS, left posterior STS, right posterior STS, and LIFG. A single LIFG response was created by averaging the parameter estimates from the two LIFG peaks listed in Table 3 (they were within 15 mm of each other, which, given the effective smoothness of the data, is an unresolvable difference). With all four regions entered into the model, a significant three-way interaction (F(18,324) = 2.13, p < 0.01) indicated that Speech Type × Attention interaction patterns truly differed among these regions. Follow-up comparisons were performed using three-way ANOVAs (Region × Speech Type × Attention) on two regions at a time. Interactions from the three temporal peaks (left anterior and bilateral posterior STS) were not reliably different, but they all differed significantly from the LIFG (left anterior STS vs LIFG: F(6,108) = 2.52, p < 0.05, left posterior STS vs LIFG: F(6,108) = 4.33, p < 0.001; right posterior STS vs LIFG: F(6,108) = 2.83, p < 0.05). Given the lack of difference among them, the three temporal peaks were averaged to create a single STS response, which differed significantly from the LIFG interaction pattern (F(6,108) = 4.15, p < 0.005). The distinct interaction profiles in LIFG and in temporal cortex are illustrated in Figure 6, b and c.
Characterizing the influence of attention on speech processing in frontal and temporal cortex
It can be seen that, for both LIFG and STS, the two-way (Speech Type × Attention) interaction is due at least in part to elevated signal for degraded speech when it is attended, compared with when it is not (Fig. 6b,c). This was confirmed by pairwise comparisons that showed that NV-hi sentences evoked significantly greater activity when they were attended, than when they were ignored (Table 4; Fig. 6b,c). Also, in both regions, rNV stimuli elicited greater activity when attention was directed toward it, or the auditory distracter, than when attention was directed toward the visual stimulus (Table 4; Fig. 6b,c).
Despite this common enhancement of activity for degraded speech that was attended, overall speech-evoked responses differed in the LIFG and STS (as evidenced by the significant three-way interaction). To quantify these differences, we compared these responses between areas with two-way ANOVAs (Region × Speech Type). Attended speech elicited a significantly different pattern of activity in LIFG than in the STS (F(3,54) = 6.12, p < 0.001). A post hoc contrast that compared degraded speech (NV-hi and NV-lo) against clear speech revealed a significant noise-elevated response in the LIFG (F(1,18) = 15.51, p < 0.001; Fig. 6b) but not in the STS (absent in Fig. 6c). Responses to unattended speech also differed significantly between these areas, as demonstrated by significant Region × Speech Type (2 × 4) interactions at both levels of distraction (Auditory Distracter: F(3,54) = 29.61, p < 0.001; Visual Distracter: F(3,54) = 17.84, p < 0.001). Linear interaction contrasts showed that unattended speech produced a steeper linear response (i.e., decreasing activity with decreasing intelligibility) in the STS than in the LIFG (Auditory Distracter: F(1,18) = 74.95, p < 0.001; Visual Distracter: F(1,18) = 39.84, p < 0.001). These results can be observed in Figure 6, b and c: in the STS, activity elicited by unattended speech decreases with intelligibility, whereas this pattern is less apparent in the LIFG. Although STS regions are significantly active (relative to rest) regardless of attention condition, responses in the LIFG region are above baseline only when attention is on the speech stimulus. Furthermore, the interaction pattern in LIFG explains why this area did not show a noise-elevated response for the main effect of speech type (Fig. 4): this response was present only for speech that was attended.
It is interesting that the noise-elevated response only for attended speech can be qualitatively observed in other brain regions that demonstrated a significant interaction. Figure 7 depicts the interaction patterns from left SMA (Fig. 7a), left insula (Fig. 7b), right caudate (Fig. 7c), and right angular gyrus (Fig. 7d). Again, attended speech elicited a noise-elevated response that was absent when attention was focused elsewhere.
Attention-dependent speech processing occurs on the upper bank of the STS
Finally, we wished to improve our localization of the temporal lobe activations revealed in the interaction analysis. It has been observed from anatomical studies of rhesus monkeys that the STS is a large and heterogeneous area of cortex, containing several distinct areas that can be parcellated according to their cytoarchitectonic and myeloarchitectonic properties, as well as their afferent and efferent connectivity (Seltzer and Pandya, 1978, 1989, 1991; Padberg et al., 2003). These include unisensory regions—auditory area TAa, along the upper bank and lip of the STS, and visual areas TEa and TEm, along the lower bank of the sulcus—and polymodal regions TPO and PGa, which lie along the upper bank and in the depth of the sulcus (Seltzer and Pandya, 1978). Area TPO itself is composed of three distinct subdivisions (TPOc, TPOi, and TPOr) which receive inputs of varying strength from frontal (ventral and prearcuate cortex) and parietal regions (Padberg et al., 2003). Therefore, precise localization of the STS peaks, which could provide important functional information (e.g., auditory vs visual vs multisensory processing), is confounded by volumetric smoothing: BOLD signal on one side of the sulcus is smoothed across to the physically close, yet cortically distant, bank of the opposite side. Smoothing in two dimensions (along the cortical sheet) overcomes this issue. Accordingly, we performed a surface-based analysis (with the Freesurfer image analysis suite: http://surfer.nmr.mgh.harvard.edu/) of the Speech Type × Attention interaction model simply for visualization purposes, so that we could more accurately locate the area exhibiting interaction within the STS region. Contrast images were inflated and smoothed along the cortical sheet, then submitted to a group-level analysis. This visualization suggests that the rostral STS peak lies on the upper bank of the STS, and so may correspond to the auditory area TAa, whereas the more caudal peak lies more in the depth of the sulcus, but still on the upper bank, and may therefore correspond to multisensory TPO cortex (Fig. 8).
This study demonstrates that the comprehension of speech that varies in intelligibility and the engagement of brain areas that support speech processing depend on the degree to which listeners attend to speech. Our behavioral and fMRI results suggest that, in our paradigm at least, clear speech is generally processed and remembered regardless of whether listeners are instructed to attend to it, but speech that is perceptually degraded yet highly intelligible is processed quite differently when listeners are distracted.
The postscan recognition data show that unattended clear speech was encoded into memory, suggesting successful comprehension; listeners were able to remember clear sentences with similar accuracy whether attended or unattended. There was no difference in recognition accuracy for sentences presented in the distraction tasks, despite the difference in task difficulty, which suggests that it is not solely attentional load that determines whether unattended speech is processed. Nonetheless, we cannot discount the possibility that more challenging tasks could disrupt the processing of unattended clear speech, and future work will address this by manipulating load with a broader range of task difficulties. Our conclusion agrees with other studies that demonstrate effects of unattended clear speech on behavioral measures (Salamé and Baddeley, 1982; Hanley and Broadbent, 1987; Kouider and Dupoux, 2005; Rivenez et al., 2006) and electrophysiological responses (Shtyrov, 2010; Shtyrov et al., 2010), but conflicts with findings that listeners usually cannot remember unattended speech when listening to another talker (Cherry, 1953; Wood and Cowan, 1995). Speech signals are acoustically very similar, and attention is likely needed to segregate the target stream from interfering talkers, thereby reducing the resources available for processing unattended speech. This may be similar to our observation that attention is required to process degraded speech: significant Speech Type × Attention interactions in our behavioral and fMRI data indicate that processing of to-be-ignored (degraded) speech is significantly disrupted.
The combination of neural and behavioral interactions provides the first demonstration that the processing of degraded speech depends critically on attention. Degraded speech was highly intelligible when it was attended, but cortical processing and subsequent memory for those sentences was greatly diminished (to chance levels for NV-lo sentences) when attention was focused elsewhere. The recognition data strongly suggest that distraction impaired perception of degraded speech more than clear speech, consistent with our on-line BOLD measures of speech processing during distraction. In both the STS and LIFG, the processing of degraded (but not clear) speech was significantly enhanced by attention.
Previous fMRI studies of speech perception have either failed to observe similar interactions or have not assessed the degree to which unattended speech is processed at a behavioral level. For instance, Heinrich et al. (2011) showed that low-level auditory processes contributing to the continuity illusion for vowels remain operational during distraction, and thus low-level, speech-specific responses in posterior STG remain intact. Sabri et al. (2008) demonstrated that speech-evoked fMRI responses are attenuated and that lexical effects are absent, or reversed, during distraction. However, without behavioral evidence, it is hard to conclude (as proposed by Sabri et al., 2008) that processing is significantly diminished when speech is ignored. Furthermore, the noise associated with continuous fMRI scanning would have created a challenging listening situation that (according to our findings) might be equally responsible for the absence of neural responses to unattended speech.
Our fMRI results demonstrate that the distributed hierarchy of brain areas underlying sentence comprehension (Davis and Johnsrude, 2003; Davis et al., 2007; Hickok and Poeppel, 2007; Obleser et al., 2007) can be parcellated by the degree to which patterns of speech-related activity depend on attention. It is only brain regions more distant from auditory cortex, probably supporting higher-level processes, that show attentional modulation. Response patterns in primary auditory regions did not depend on attention (i.e., there was no interaction), despite a reliable main effect consistent with other studies of auditory attention (Alho et al., 2003; Hugdahl et al., 2003; Petkov et al., 2004; Fritz et al., 2007). This suggest that early auditory cortical processing of speech is largely automatic and independent of attention, but can be enhanced (or suppressed) by attention.
In contrast, areas of left frontal and bilateral temporal cortex exhibited robust changes in patterns of speech-evoked activity due to changes in attentional state. In both regions, this dependence manifested primarily as an increase in activity for degraded speech when it was attended compared with when in was ignored. However, the dissimilarity of patterns in these regions (i.e., the significant three-way interaction) provides evidence that attention influences speech processing differently in these areas. When speech was attended, LIFG activity for degraded speech was greater than for clear speech (i.e., a noise-elevated response), whereas in the STS, activity for degraded speech was enhanced to the level of clear speech. During distraction conditions, LIFG activity did not depend on speech type, but activity in the STS was correlated with intelligibility. Together, these results suggest that the LIFG only responds to degraded speech when listeners are attending to it, whereas the STS responds to speech intelligibility, regardless of attention or how that intelligibility is achieved. Increased activity for attended degraded speech may reflect the improvement in intelligibility afforded by explicit, effortful processing, or by additional cognitive processes (such as perceptual learning) that are engaged under directed attention (Davis et al., 2005; Eisner et al., 2010). A recent behavioral study demonstrated that perceptual learning of NV stimuli is significantly impaired by distraction under conditions similar to those studied here (Huyck and Johnsrude, 2012).
These fMRI results are consistent with the proposal that speech comprehension in challenging listening situations is facilitated by top-down influences on early auditory processing (Davis and Johnsrude, 2007; Sohoglu et al., 2012; Wild et al., 2012). Due to their anatomical connectivity, regions of prefrontal cortex—including LIFG and premotor cortex—are able to modulate activity within early auditory belt and parabelt cortex (Hackett et al., 1999; Romanski et al., 1999) and intermediate stages of processing on the dorsal bank of the STS either directly (Seltzer and Pandya, 1989, 1991; Petrides and Pandya, 2002a,b) or indirectly through parietal cortex (Petrides and Pandya, 1984, 2009; Rozzi et al., 2006). LIFG has been shown to contribute to the processes involved in accessing and combining word meanings (Thompson-Schill et al., 1997; Wagner et al., 2001; Rodd et al., 2005), and this information could be used to recover words and meaning from an impoverished speech signal. Somatomotor representations may also be helpful for parsing difficult-to-understand speech (Davis and Johnsrude, 2007), including NV stimuli (Wild et al., 2012; Hervais-Adelman et al., 2012). We note that many of the fMRI and transcranial magnetic stimulation studies that implicate left premotor regions in speech processing have similarly used degraded speech or other stimuli that are difficult for listeners to understand (Fadiga et al., 2002; Watkins et al., 2003; Watkins and Paus, 2004; Wilson et al., 2004; Wilson and Iacoboni, 2006; Osnes et al., 2011). We also observed significant interactions in bilateral insular cortex and in the left caudate nucleus. These areas connect with primary auditory cortex, prefrontal cortex, (supplementary) motor regions, and temporoparietal regions (Alexander et al., 1986; Middleton and Strick, 1994, 1996; Yeterian and Pandya, 1998; Clower et al., 2005) and have been shown to be involved in phonological processing (Abdullaev and Melnichuk, 1997; Bamiou et al., 2003; Tettamanti et al., 2005; Booth et al., 2007; Christensen et al., 2008). The interactions observed in these areas are consistent with the idea that motoric representations are engaged during effortful speech perception.
In light of our results, we propose that the interaction pattern observed in higher-order speech-sensitive cortex is a neural signature of effortful listening. Effort has recently become a topic of great interest to applied hearing researchers and is typically assessed through indirect measures; for example, autonomic arousal (Zekveld et al., 2010; Zekveld et al., 2011; Mackersie and Cones, 2011), the degree to which participants are able to perform a secondary task (Howard et al., 2010), or, as in our study, the ability of listeners to remember what they had heard (Rabbitt, 1968, 1990; Stewart and Wingfield, 2009; Tun et al., 2009). We propose that fMRI can be used to operationalize listening effort more directly by comparing the effortful BOLD response between attended and unattended speech conditions in the network of frontal areas we have observed. To validate this proposal, future work will attempt to relate individual differences in this BOLD response to listener attributes, such as the ability to divide attention, experience with degraded speech, and other cognitive factors. Neural measures of effortful listening might provide a novel means of assessing the efficacy and comfort of hearing prostheses, and help researchers to optimize the benefit obtained from these devices.
Our findings unequivocally demonstrate that the extent to which degraded speech is processed depends on the listener's attentional state. Whereas clear speech can be processed even when ignored, comprehension of degraded speech appears to require focused attention. Our fMRI results are consistent with the idea that enhanced processing of degraded speech is accomplished by engaging higher-order language-related processes that modulate earlier perceptual auditory processing.
This work was supported by Natural Science and Engineering Research Council of Canada and Canadian Institutes of Health Research.
The authors declare no financial conflicts of interest.
- Correspondence should be addressed to Conor Wild, The Brain and Mind Institute, Western University, London, Ontario, Canada N6A 5B7.