Abstract
Humans use prior expectations to improve perception, especially of sensory signals that are degraded or ambiguous. However, if sensory input deviates from prior expectations, then correct perception depends on adjusting or rejecting prior expectations. Failure to adjust or reject the prior leads to perceptual illusions, especially if there is partial overlap (and thus partial mismatch) between expectations and input. With speech, “slips of the ear” occur when expectations lead to misperception. For instance, an entomologist might be more susceptible to hear “The ants are my friends” for “The answer, my friend” (in the Bob Dylan song Blowing in the Wind). Here, we contrast two mechanisms by which prior expectations may lead to misperception of degraded speech. First, clear representations of the common sounds in the prior and input (i.e., expected sounds) may lead to incorrect confirmation of the prior. Second, insufficient representations of sounds that deviate between prior and input (i.e., prediction errors) could lead to deception. We used crossmodal predictions from written words that partially match degraded speech to compare neural responses when male and female human listeners were deceived into accepting the prior or correctly reject it. Combined behavioral and multivariate representational similarity analysis of fMRI data show that veridical perception of degraded speech is signaled by representations of prediction error in the left superior temporal sulcus. Instead of using top-down processes to support perception of expected sensory input, our findings suggest that the strength of neural prediction error representations distinguishes correct perception and misperception.
SIGNIFICANCE STATEMENT Misperceiving spoken words is an everyday experience, with outcomes that range from shared amusement to serious miscommunication. For hearing-impaired individuals, frequent misperception can lead to social withdrawal and isolation, with severe consequences for wellbeing. In this work, we specify the neural mechanisms by which prior expectations, which are so often helpful for perception, can lead to misperception of degraded sensory signals. Most descriptive theories of illusory perception explain misperception as arising from a clear sensory representation of features or sounds that are in common between prior expectations and sensory input. Our work instead provides support for a complementary proposal: that misperception occurs when there is an insufficient sensory representations of the deviation between expectations and sensory signals.
- fMRI
- misperception
- predictive coding
- prior expectations
- representational similarity analysis
- speech perception
Introduction
The underlying neural signals that distinguish veridical and illusory perception remain unspecified. Perceptual illusions occur if sensory input deviates from prior expectations and perceivers fail to adjust or reject priors (Fletcher and Frith, 2009). Misperception is especially pronounced if there is partial overlap (and thus partial mismatch) between prior expectations and sensory input.
There are two plausible neural mechanisms for generating perceptual illusions. First, misperception could arise due to clearer representations of the expected elements of sensory signals (McClelland and Elman, 1986; Norris et al., 2000). An alternative, prediction error theory (Mumford, 1992; Rao and Ballard, 1999; Friston, 2005), proposes a complementary mechanism: that misperception occurs when neural representations of sensory signals that deviate from prior expectations are absent. Both of these neural implementations of Bayesian perceptual inference can equally simulate a reduction of univariate activity for anticipated sensory signals (Blank and Davis, 2016; Aitchison and Lengyel, 2017): (1) clearer representations of expected stimuli would lead to reduced noise or competition from alternative interpretations or (2) “prediction error” representations would be reduced for expected input. Both of these theories are supported by the routine observation that neural activity is reduced for repeated stimuli (repetition suppression; Henson, 2003; Grill-Spector et al., 2006; Summerfield et al., 2008). Although reduced activity is plausibly due to a change in prior expectations (i.e., repeated stimuli are expected), it is not established whether repetition suppression is linked to reduced noise or reduced prediction errors in neural representations. In this work, we distinguish these two explanations using repetition-induced slips of the ear; that is, misperception of spoken words (Bond, 1999).
We therefore sought to measure speech representations in the left posterior STS (pSTS). This region shows effects of prior written word presentations on neural representations for degraded spoken words (Blank and Davis, 2016). Other studies have similarly shown influences of prior knowledge on pSTS activity during audiovisual speech processing (lip-reading: Nath and Beauchamp, 2011; Blank and von Kriegstein, 2013) and due to perceptual learning (Kilian-Hütten et al., 2011; Sohoglu and Davis, 2016; Bonte et al., 2017). Furthermore, multivariate pattern analysis shows that syllable identity can be decoded from fMRI responses in the pSTS (Formisano et al., 2008; Evans and Davis, 2015).
We used presentations of written text to manipulate prior knowledge (Sohoglu et al., 2014) and recorded perceptual and neural (fMRI) responses to degraded (vocoded) spoken words (Shannon et al., 1995; Fig. 1A). Written and spoken words were combined into the following: (1) match trials (i.e., written and spoken words were identical, e.g., whip–whip); (2) total mismatch trials (i.e., written/spoken words were phonologically unrelated, e.g., pit–corn); or (3) partial mismatch trials (i.e., written/spoken words had different initial or final sounds, e.g., kip–pip, pick–pip). Partial mismatch trials lead to frequent misperception because listeners often report that the written and spoken words match (Sohoglu et al., 2014). On each trial, participants provide a four-alternative button press to report whether the spoken word matched the previous written word (1 = “definitely same,” 2 = “possibly same,” 3 = “possibly different,” and 4 = “definitely different”).
Experimental design and hypotheses. A, Experimental design. We used fMRI to measure brain activity while participants read written words and heard subsequent degraded spoken words. Written and spoken words were combined in three conditions: (1) match (identical written/spoken words; e.g., whip–whip), (2) partial mismatch (e.g., kip–pip or pick–pip), and (3) total mismatch (e.g., pit–corn). Participants responded with a button press to indicate whether spoken/written words were “same” or “different” and their confidence. B, Stimulus conditions, responses, and underlying neural mechanisms for representing written/spoken word pairs. In an onset partial mismatch trial (depicted in the third row) the spoken word /pIp/ (“pip”) after written “KIP” can be perceived as “kip” (= “same” response) or “pip” (“different”). This behavioral outcome could be explained by one of two neural mechanisms. First, a representation of common sounds would produce a clear representation of the sounds “ip” (shown in black); this representation would be clearer (black) for trials in which participants report that written/spoken words are the “same” than if participants report that written/spoken words differ. Second, a representation of deviating sounds would produce a clearer representation of the deviating sounds “−k + p” (black) on trials in which participants report that written/spoken words differ and an unclear representation of the deviating sounds “−k + p” (light gray) if participants report that written/spoken words are the “same.” C, Similarity between partial mismatch pairs depends on the underlying neural mechanism. A neural representation of common sounds in written/spoken word pairs predicts that representations for word pairs sharing the same expected sounds (grouped by color, left side) should be more similar (e.g., KIP-/pIp/ share sounds “ip” and are therefore more similar to RIP-/wIp/, also sharing “ip” than to KICK-/pIk/ or RICK-/wIk/ sharing “ick”). In contrast, a neural representation of deviating sounds (i.e., prediction error, grouping by shape, right side), predicts that word pairs sharing the same deviating sounds should be more similar (e.g., KIP-/pIp/ deviate in “−k + p” and should be more similar to KICK-/pIk/, also deviating in “−k + p” than to RIP-/wIp/ and RICK-/wIk/ deviating in “−r + w”). Similar examples apply in other conditions (i.e., offset partial mismatch trials), ensuring that differential representation of onset and offset segments does not favor one or other account.
Partial mismatch trials manipulated which speech sounds were in common with or deviated from prior expectations (Fig. 1B, Table 1) to distinguish two mechanisms for combining prior expectations and sensory signals. First, speech-sensitive brain regions could represent sounds that are common between input and prior expectation (Kok et al., 2012); clear representations of common sounds lead to confirmation of the prior (misperception) and unclear representations of common sounds to rejection (correct perception). Second, the brain could represent unexpected sounds that deviate between input and prior (i.e., prediction error; Rao and Ballard, 1999; Blank and Davis, 2016); clear representations of deviating sounds (prediction errors) lead to rejection of the prior and unclear representations of deviating sounds to confirmation (misperception; Fig. 1B). These two mechanisms make distinct predictions for which pairs of partial mismatch trials will evoke similar patterns of neural activity in speech-responsive regions (Fig. 1C), which we tested with multivariate fMRI.
Partial mismatch pairs
Materials and Methods
Design
To investigate the influence of prior expectations on the perception of degraded speech, behavioral responses and BOLD signals were acquired in an event-related fMRI design. Prior expectations were provided by presenting written words before degraded spoken words. The paring of written and degraded spoken words was manipulated in three conditions: 1) match trials in which written and spoken words were identical (e.g., kit–kit); (2) total mismatch trials in which the spoken word was phonologically unrelated to the written word (e.g., kit–ball); and 3) partial mismatch trials in which the spoken and written word were phonologically different at the end of the word (offset mismatch; e.g., kit–kick) or were phonologically different at the beginning of the word (onset mismatch; e.g., kit–pit). Each condition contained 32 different word pairs that were repeated throughout the experiment. Behavioral responses were collected in a four-alternative forced-choice task in which participants had to indicate whether they believed that the spoken word matched the previous written word (1 = “definitely same,” 2 = “possibly same,” 3 = “possibly different,” 4 = “definitely different”). In all following analyses, we merged responses 1 and 2 to “same” and 3 and 4 to “different” without considering confidence.
Ethics statement
Ethical approval was provided by Cambridge Psychology Research Ethics committee under approval number 2009.46. All participants provided their written informed consent.
Participants
Twenty-seven healthy native-English speakers (age 18–37 years) took part in the experiment after giving their informed consent. All participants were right-handed and reported normal or corrected-to-normal vision and no history of language, reading, or hearing impairments. Data from three participants had to be excluded [one due to technical problems during scanning, one due to an excessive number of missing behavioral responses (203 missed responses of 1280, 15.86% missed responses) that was >4 SDs above the mean number of missed responses (M = 29.56, SD = 40.53), and one due to aberrant behavioral responses (too few “definitely different” responses in the total mismatch condition). The following analyses were therefore performed using data from 24 participants (M = 24.17, SD = 5.01; 9 males and 15 females).
Stimuli
Stimuli consisted of 32 monosyllabic words, which were presented in spoken and written format. Auditory words were spoken by a male speaker of southern British English and recorded at 16-bit with a sampling rate of 44.1 kHz. The duration of spoken words ranged from 432 to 701 ms (M = 532, SD = 64). The 32 words consisted of two sets of 16 words; each set containing a different vowel and items formed from four different onset and four different offset sounds (set 1: kit, kitsch, kip, kick, pit, pitch, pip, pick, writ, rich, rip, rick, wit, witch, whip, wick; set 2: corn, call, court, cork, torn, tall, taught, talk, born, ball, bought, baulk, warn, wall, wart, walk). Written and spoken words were combined in three conditions: (1) 32 match pairs (identical written and spoken words, e.g., whip–whip), (2) 32 partial mismatch onset and 32 partial mismatch offset pairs (e.g., pit–kit, or pit–pitch), and (3) 32 total mismatch pairs (e.g., pit–corn). We selected item pairs in the partial mismatch trials carefully so that we could group these pairs into quadruples with the same common sounds and deviating sounds between written and spoken forms. These common sound and deviating sound groups allow us to address our central research question concerning neural representations underlying speech perception and misperception (see Table 1 for a full list of item pairs and associated groups).
The amount of spectrotemporal detail of each spoken word was reduced by applying a noise-vocoding procedure (Shannon et al., 1995) using a custom-made MATLAB (The MathWorks) script. The script used six spectral channels that were logarithmically spaced between 70 and 5000 Hz and superimposed the slow temporal envelope (low-pass filtered at 30 Hz) onto corresponding band-pass-filtered white noises. These parameters were chosen on the basis of previous perceptual data suggesting that they would result in high accuracy for match and total mismatch trials and variable responses on partial mismatch trials (see experiment 3 in Sohoglu et al., 2014).
Stimuli were delivered and behavioral responses recorded using E-Prime 2.0 software (Psychology Software Tools). Visual stimuli were presented on a screen at the end of the scanner table, which participants could see through a mirror attached to the head coil above their eyes. Auditory stimuli were presented binaurally through in-ear headphones (Sensimetrics, model S14) after preprocessing to ensure a flat frequency response and presentation at a comfortable listening volume.
Before scanning, participants completed two practice sessions. The first session was to familiarize participants with noise-vocoded speech and the “same/different” task to be used in the scanner. The second practice session was identical in task and timing to the main experiment and participants were given feedback and repeated practice to ensure that they made their responses within a 2.5 s time limit.
fMRI procedure
The fMRI experiment lasted 75 min (5 MRI scanning sessions of 15 min). Each session included 300 randomized trials (256 event trials plus 44 null events). We used a fast sparse-imaging protocol in which the duration of each trial was 3 s and noise-vocoded spoken words were presented in the silent gap between scans (Fig. 1A). Within each trial, a fixation cross was presented for 500 ms, followed by the written word presentation for 500 ms, and finally a window of 500 ms with the spoken word. This 500 ms delay between written and spoken word onset has been shown to be sufficient time to generate a prior expectation for the subsequent word (Sohoglu et al., 2014). During and after each vocoded word, a blank screen was presented for 1.5 s. Participants were given 2.5 s from the onset of each spoken word to make a 4-alternative response indicating whether the spoken word matched the preceding written word. Participants gave responses by pressing one of four buttons on a response box using the fingers of their right hand. Throughout the experiment, the words “same” and “different” were presented at the bottom of the screen to remind participants of the corresponding response buttons.
Each word occurred as a prior written word or spoken word with equal probability and each word pair was repeated twice within each scanning session so that there were 10 presentations of each written–spoken word pair during the experiment. In addition to these experimental trials, each scanning session included 44 null events (trials without presentation of a written or spoken word) to aid estimation of a resting baseline.
Scanning parameters
Structural scanning.
MRI data were acquired on a 3 T Siemens Prisma scanner using a 32-channel head coil. A T1-weighted structural scan was acquired for each subject using a three dimensional MPRAGE sequence (TR: 2250 ms, TE: 3.02 ms, flip angle: 9°, spatial resolution: 1 × 1 × 1 mm).
Functional scanning.
For each participant and scanning run, 312 echoplanar imaging (EPI) volumes comprising 32 slices of 3 mm thickness were acquired using a continuous, descending acquisition sequence (TR: 3000 ms, TA: 2000 ms, TE: 30 ms, FA: 84°, matrix size: 64 × 64, in-plane resolution: 3 × 3 mm, interslice gap: 25%). Of these images, the first three EPI volumes were discarded (to allow for T1 equilibrium effects) and an additional nine EPI volumes were acquired after the last event of each scanning run. We used transverse oblique acquisition with slices angled away from the eyes.
Acoustic similarity analysis
Acoustic dissimilarity between spoken words was computed using methods described previously (Billig et al., 2013). The matrix in Figure 2B illustrates the spectrotemporal similarity between stimuli. For each token, a gamma-tone-based Fourier transform was computed, approximating the frequency analysis performed by the ear. A spectral similarity matrix was then generated for each pair of tokens by comparing the spectral profile (on a log scale) of all time slices. Next, the maximum similarity path through this similarity matrix was found using dynamic time warping. Summed similarity values along this path were computed and rank transformed such that the two most similar sound files were assigned a score of 0 and the two most dissimilar sound files were given a score of 1. As in Billig et al. (2013), overall similarity reflects both shared vowels and consonants, although vowel similarity has a greater influence. The spectral analysis and dynamic time warping were implemented in MATLAB using existing functions for Gammatone spectral analysis and dynamic time warping supplied by Ellis, available at http://www.ee.columbia.edu/~dpwe/resources/matlab/gammatonegram/ and http://labrosa.ee.columbia.edu/matlab/dtw.
Behavioral analysis
First, we tested whether participants perceived word pairs in the match condition as being the “same” and pairs in the total mismatch condition as being “different” with repeated-measures ANOVAs and paired t tests in MATLAB (The MathWorks)
Second, we tested perception in the partial mismatch condition. To determine whether the rate of misperception for individual items was due to sounds that were in common with or deviated between the prior and input, we compared p(“different”) for each partial mismatch pair with two groups of word pairs. These groups either had the same sounds in common (common sound groups) or had the same deviating sounds (deviating sound groups). The goal of this analysis was to determine whether perceptual outcomes (i.e., responding “same” or “different”) for a specific partial mismatch word pair (e.g., kit–pit) was better predicted by perception of: (1) three word pairs sharing the same deviating sounds (changing /k/ to /p/ as in kitsch–pitch, kip–pip, and kick–pick) or (2) three word pairs sharing the same common sounds (common sounds /It/, as in pit–kit, writ–wit, and wit–writ). To measure the similarity of behavior in each of these groups, we computed the sum squared difference between p(“different”) for each item pair with the mean of p(“different”) for three word pairs from the common sound group or three word pairs selected from the deviating sound group. The sum-squared difference values were averaged over all partial mismatch items in each participant and over all participants for each item and entered into paired t tests and ANOVAs by participants and items.
Univariate fMRI analysis
Data were analyzed using SPM8 (http://www.fil.ion.ucl.ac.uk/spm) applying automatic analysis (aa) pipelines (Cusack et al., 2014). The first three volumes of each run were removed and the remaining scans were realigned to the first EPI image for each participant. The structural image was coregistered to the mean functional image and the parameters from the segmentation of the structural image were used to normalize the functional images, which were resampled to 2 mm isotropic voxels. The realigned normalized images were then smoothed with a Gaussian kernel of 8 mm full-width at half-maximum (FWHM). Data were analyzed using a general linear model with a 128 s high-pass filter. We included the onset of seven event types in the GLM each convolved with the canonical SPM hemodynamic response: seven conditions come from specifying the onset of spoken words paired with four types of written text, depending on perception: (1) match perceived as “same,” (2) onset partial mismatch perceived as “same,” (3) offset partial mismatch perceived as “same,” (4) onset partial mismatch perceived as “different,” (5) offset partial mismatch perceived as “different,” (6) total mismatch perceived as “different,” and (7) errors (i.e., match perceived as “different” and total mismatch perceived as “same”).
After parameter estimation of the first level model, we conducted t tests of total mismatch perceived as “different” versus match perceived as “same” and partial mismatch perceived as “different” versus partial mismatch perceived as “same.” Brain regions are labeled based on the AAL atlas (Tzourio-Mazoyer et al., 2002).
Multivariate fMRI analysis
In the univariate analysis, we modeled BOLD responses combined over all item pairs with an individual condition (i.e., match, partial mismatch onset/offset, and total mismatch), but separated trials based on participants' behavioral responses (“same” vs ”different”). This allowed us to measure the impact of perception on the magnitude of neural responses in partial mismatch trials. In the multivariate analysis, the first-level model was specified based on separating specific item pairs within each of the experimental conditions regardless of behavioral responses (i.e., “same” and ”different” responses were combined). This change was motivated for two reasons. First, we wanted to avoid empty cells for single item pairs. This was necessary because, for some participants, there were word pairs that were always perceived as “same” or as “different” in all 10 repetitions of a particular partial mismatch trial. Second, we wanted to ensure that there were the same number of trials for each item pair included in the analysis. This avoids differences between neural representations for specific item pairs due to combining a different number of trials in the analysis.
Multivariate analyses were conducted on realigned data within each participant's native space without normalization or spatial smoothing. An additional first-level model was constructed for each participant. This model contained four conditions for which there were sufficient numbers of repetitions for item-specific modeling (match, total mismatch, onset partial mismatch, and offset partial mismatch). Importantly, regressors for the 32 individual spoken words were used in each of these four conditions. This resulted in 128 conditions per participant per run. For each of the 128 item-specific regressors, we estimated single-subject T-statistic images for the contrast of speech onset compared with the unmodeled resting period averaged over the five scanning runs.
We used the resulting single condition and item T-images (contrasted with the unmodeled resting baseline) for representational similarity analysis (RSA) (Kriegeskorte et al., 2008) using the RSA toolbox (Nili et al., 2014).We used T-images so that effect sizes were weighted by their error variance, which reduced the influence of large but variable response estimates for multivariate analyses (Misaki et al., 2010). RSA involves testing whether the observed similarity of brain responses in specific conditions (a neural representational dissimilarity matrix or RDM) corresponds to a hypothetical pattern of similarity between these conditions (hypothesis RDM).
We constructed two hypothesis RDMs to test for greater similarity between word pairs. The first RDM tested word pairs that shared the common sounds between prior and spoken word in either onset (e.g., kit–kitsch, kip–kick; here, the onset “ki” is the same for both word pairs) or offset (e.g., kit–pit, writ–wit; here, the offset “it” is the same for both word pairs). The second RDM tested word pairs that shared the same deviating sounds between prior and spoken word in either onset (e.g., kit–pit, kitch–pitch; here, the different onsets “−k + p” are the same across word pairs) or offset (e.g., kit–kitsch, pit–pitch; here, the different offsets “−t + t∫” are the same across word pairs). Onset and offset groups were combined in one single hypothesis RDM. We excluded between vowel comparisons to ensure that the results were not influenced by vowel representations, which we have observed in previous studies (Evans and Davis, 2015; Blank and Davis, 2016). In addition, similarity between identical items (i.e., the main diagonal) was not included in our hypothesis RDMs (Fig. 5C,D).
In a first step, we used these RDMs to test for differences between common and deviating sound groups without taking behavior into account. In a second step, to determine whether representations of common or deviating sounds in the STS better explain perception and misperception of specific word pairs, we used behavioral measures as weights in the RSA. Specifically, we averaged the rate of “different” responses across the four word pairs contributing to each common or deviating sound group and rank ordered these groups in terms of the rate of accurate perception/misperception. With these ranks, we constructed hypothesis RDMs for individual participants to test for similarity between word pairs that shared common sounds in partial mismatch pairs or deviating sounds in partial mismatch pairs while incorporating variability in perceptual outcomes. Our reasoning was that neural representations of common sounds in partial mismatch trials should be more apparent the more often a word pair is perceived as the “same” (see Fig. 5A). Conversely, neural representations of deviating sounds should be stronger or more reliable for partial mismatch word pairs that are more often perceived as “different” (see Fig. 5B for an illustration of these predictions). Because the weights in the hypothesis RDMs express expected dissimilarity values (i.e., higher values for higher dissimilarity which is the same as lower similarity), we reversed the ranking of the behavioral measures. For these analyses, we used perceptual outcomes for individual participants. Because we only aimed at testing for a monotonic relationship between perception and neural similarity, we rank ordered behavioral response and used a Spearman correlation to test the relationship between hypothetical and neural RDMs. We rank transformed the proportion of “different” responses for onset and offset partial mismatch groups separately for each of the two vowel sets (/I/ and /ä as in “kick” and “tall”). This ensured that these analyses link the rate of perception/misperception to informative neural representations of common/deviating sounds rather than to differences in the representation of the two vowels in our stimulus set (because these gave rise to different overall rates of speech perception/misperception).
ROI definition.
Our key question concerned neural representation of partial mismatch trials in the left pSTS; a region previously shown to integrate prior expectations and spoken words (Blank and Davis, 2016). Importantly for our RSA analysis approach, multivariate BOLD signals have been used to decode syllable identity in several previous studies (Formisano et al., 2008; Boets et al., 2013; Du et al., 2014; Evans and Davis, 2015; Blank and Davis, 2016). To locate this ROI for multivoxel RSA, we compared neural responses to total mismatch and match trials [t-contrast: total mismatch (“different” response) > match (“same” response) at p < 0.001]. In addition, to remove activations that extended into adjacent parietal regions, we applied a mask of the combined STG and MTG clusters from the Harvard–Oxford cortical structural atlas. The size of the ROI was a volume of 1148 mm3 corresponding to 34 voxels in the RSA voxel size if 3 × 3 × 3.75. The same ROI was used for the analysis of deviating and common sound groups. This ROI definition is based on entirely independent conditions (total mismatch vs match) from the conditions used in the RSA analysis (which is focused on partial mismatch trials). Furthermore, this ROI definition does not favor the representation of either deviating or common sounds in the main RSA analysis. Our previous work (Blank and Davis, 2016) has shown that univariate activation differences between unexpected and expected stimuli are equally consistent with two types of neural computation: A sharpening model (without representation of prediction errors) explains the decreased response in the match condition as due to a suppressed representation of unexpected features; that is, a reduced representation of deviating sounds and an enhanced representation of common sounds. Alternatively, a prediction error model explains the decreased response in the match condition as being due to reduced prediction errors that reduce the representation of common sounds and enhanced representations of deviating sounds in total mismatch conditions (Blank and Davis, 2016).
Searchlight analysis.
We conducted a whole-brain searchlight analysis to ensure that we did not overlook significant effects outside of the ROI that we defined a priori. We measured multivoxel neural RDMs by computing the dissimilarity (1 − Pearson correlation across voxels) of T-statistics for all possible combinations of items and conditions. In a searchlight analysis, the sets of voxels were extracted by specifying gray matter voxels (voxels with a value >0.20 in a probabilistic gray matter map) within a 10-mm-radius sphere of each gray matter voxel (with a voxel size of 3 × 3 × 3.75 mm; i.e., a maximum of 65 voxels per sphere). This was repeated for all searchlight locations in the brain. The similarity between the observed RDM and each of the hypothetical RDMs was computed using a Spearman correlation for each searchlight location and the resulting correlation coefficient returned to the voxel at the center of the searchlight. This resulted in a Spearman correlation map for each participant in each gray matter voxel. To assess searchlight similarity values across participants at the second level, the Spearman correlation maps for each participant were Fisher's z-transformed to conform to Gaussian assumptions, normalized to MNI space, and spatially smoothed with a 10 mm FWHM Gaussian kernel for group analysis. (For a visualization of our RSA procedure, see Figure 2 in Kriegeskorte et al., 2008.) We extracted similarity values from searchlights within our ROI defined using the independent contrast from the univariate fMRI analysis.
ROI analysis.
In addition to using the ROI defined by the univariate analysis total mismatch (“different” response) > match (“same” response) as a search volume in the whole brain RSA analysis (previous section), we used this ROI to extract neural RDMs from the partial mismatch conditions to test for representations of deviating and common sounds. Specifically, we correlated the neural RDM from this ROI with the behaviorally weighted hypothesis RDMs for deviating and common groups. We conducted one-sample t tests on the obtained a Fisher's z-transformed Spearman correlation value for these two RDMs to determine whether the correlation was significantly >0 for the two conditions individually. We then tested for differences between these conditions in a paired t test. This approach allows us to test the representation in our a priori defined ROI specifically. There are some methodological differences between the whole-brain searchlight and the ROI approach: (1) the same number of vowels per sphere across all searchlight locations across the brain versus one fixed cluster size in the ROI approach, (2) gray matter masking in the searchlight approach and none in the ROI approach, and (3) comparison of searchlight locations across subjects in MNI space versus transformation of individual ROIs to subjects' native space in the ROI approach.
Results
Partial mismatch with prior expectations leads to frequent misperception
Behavioral responses confirmed that participants correctly perceived written and spoken word pairs in the match condition as identical and pairs in the total mismatch condition as “different.” Perception in the partial mismatch condition was more variable such that listeners were often deceived into reporting that spoken word matched the written prior (Fig. 2C–E). A repeated-measures one-way ANOVA revealed significant differences between these three conditions (F(23,2) = 603.303, p < 0.001). Post hoc paired t tests confirmed more “same” responses in match than partial mismatch conditions (t(23) = 17.719, p < 0.001) and in partial mismatch than in total mismatch conditions (t(23) = 11.782, p < 0.001).
Stimulus similarity, behavioral confusion matrix, and behavioral results. A, Stimulus similarity matrix. We combined 32 written words with 32 spoken words in three different conditions (match, total mismatch, and partial mismatch at onset or offset) so that the experiment contained 128 different spoken/written word pairs. Pairs of written and spoken words had varying numbers of overlapping sounds in three different experimental conditions. In match trials, all three speech segments overlapped (blue diagonal); in total mismatch trials, no segments overlapped (red). In partial mismatch trials, two segments overlapped between written and spoken words (yellow). B, Acoustic similarity. Acoustic dissimilarity between six-channel vocoded spoken words was computed using methods described previously (Billig et al., 2013) and are shown for critical word pairs in rank order. C, Mean behavioral responses. Participants responded to each word pair to indicate whether written and spoken words matched and their confidence (1 = “definitely same,” 2 = “possibly same,” 3 = “possibly different,” and 4 = “definitely different”). Match trials were perceived as “definitely same.” Total mismatch trials were perceived as “definitely different.” Partial mismatch trials were perceived as “same” or “different” with reduced confidence. D, SD of responses per word pair. Behavioral responses in the match and total mismatch conditions were consistent (blue), whereas responses in the partial mismatch condition were more variable. E, Behavioral responses showed more “same” responses (light gray bars) in the match than in the partial mismatch and total mismatch conditions. Conversely, participants responded correctly (“different”) in a large proportion of partial mismatch and in almost all total mismatch trials (dark gray bars). Error bars show SEM over subjects corrected for repeated-measures comparisons. F, Proportion of “different” responses shown separately for partial mismatch trials conditions split by vowel and for onset/offset partial mismatch. Error bars show the SEM over items.
Within the partial mismatch trials, the rate of “different” responses was related to a measure of acoustic similarity/dissimilarity between expected and heard speech. Acoustic dissimilarity between six-channel vocoded spoken words and between six-channel vocoded and clear spoken words was computed using correlation methods described previously (Billig et al., 2013). Degraded spoken words that were more similar to the six-channel vocoded acoustic form of the preceding written word were more often judged to be identical and more dissimilar spoken word pairs were more often judged to be “different” (r(62) = 0.3906, p = 0.0014; Fig. 2B,C). However, this correlation with behavior in the partial mismatch condition was not apparent for similarity between six-channel vocoded and clear spoken words (r(62) = −0.0039, p = 0.9754), but only when all conditions including match and total mismatch were considered (r(126) = 0.5195, p < 0.001 for clear-to-degraded similarity and r(126) = 0.8859, < 0.001 for degraded-to-degraded similarity). However, this finding does not explain whether it is the acoustic similarity of common sounds or acoustic dissimilarity of deviating sounds that is more important for determining perception and misperception in partial mismatch trials. To explore this issue, we used between-item and between-participant variation in perception of partial mismatch trials (Fig. 2D).
To determine whether perception depends more on common or deviating sounds between prior written text and degraded speech input, we compared rates of perception and misperception for each partial mismatch word pair with two other groups of word pairs (Fig. 3A). This analysis assessed whether perception of a specific partial mismatch word pair (e.g., kit–pit) is better predicted by perception of three other word pairs that share the same common sounds (i.e., pit–kit, wit–writ, and writ–wit, which all contain a common offset it) or the same deviating sounds (i.e., kip–pip, kitsch–pitch, kick–pick, which all contain a deviating onset “−k + p”). All 64 partial mismatch item pairs (32 onset and 32 offset mismatch pairs) were grouped into 16 common sound groups and 16 deviating sound groups (for a full list, see Table 1). Within each group, we computed the sum square difference of response rates (i.e., proportion of ”different” responses) to assess whether more consistent behavioral responses were apparent for partial mismatch word pairs grouped by their common or deviating sounds.
Perception of partial mismatch pairs is predicted by the identity of deviating sounds. A, Schematic illustration of behavioral analysis of partial mismatch trials. For each partial mismatch pair, we computed the sum square difference between the rate of “different” responses for that item pair and the three other partial mismatch word pairs that share either the same common sounds (e.g., kit–pit compared with pit–kit, wit–writ, etc.) or deviating sounds (e.g., kit–pit compared with kip–pip, kitsch–pitch, etc.). This analysis is independent of the overall rate of “different” responses, but considers the consistency of responses between items within the same group. B, Perceptual outcomes were significantly more similar for word pairs sharing the same deviating sounds than for word pairs sharing the same common sounds (i.e., reduced sum squared difference). Error bars show the SEM over items. C, Mean squared differences for common and deviating sound groups split by vowel and for onset/offset partial mismatch.
Responses to partial mismatch pairs were significantly more similar (i.e., lower sum square difference) for word pairs sharing the same deviating sounds than for items sharing the same common sounds (paired t tests over items: t(63) = 6.744, p < 0.001 and participants: t(23) = 10.567, p < 0.001, averaged data shown in Fig. 3B). Behavioral performance is more homogenous when different partial mismatch item pairs are grouped according to the deviating sound compared with groups organized according to the common sound. These results therefore indicate that speech perception and misperception are better predicted by the specific speech sounds that deviate from prior expectation than by the sounds that are consistent with prior expectations.
For completeness, we ran additional exploratory analyses on behavioral data separating partial mismatch trials with different vowels and onset/offset mismatch. For p(“different”) (Fig. 2F), ANOVAs by participants (F1) and items (F2) showed significant main effects of vowel identity (F1(1,23) = 117.413, p < 0.001; F2(1,60) = 20.64, p < 0.001) and onset/offset (F1(1,23) = 13.925, p = 0.001; although this was only a trend by items: F2(1,60) = 3.37, p = 0.0712), as well as an interaction (F1(1,23) = 51.219, p < 0.001; F2(1,60) = 5.42, p = 0.0233).
In addition, we conducted ANOVAs on sum-squared-difference values derived from the behavioral data (Fig. 3C). For word pairs grouped by deviating sounds, there was no main effect of vowel (F1(1,23) = 0.152, p = 0.6999; F2(1,60) = 0.01, p = 0.9133) and no consistent effect of onset and offset (F1(1,23) = 11.106, p = 0.0029; F2(1,60) = 1.78, p = 0.1876) or interaction of vowel and onset/offset (F1(1,23) = 56.164, p < 0.001; F2(1,60) = 1.76, p = 0.1892). For the common sound groups, there were no significant effects (main effect of vowel: F1(1,23) = 0.375, p = 0.5462; F2(1,60) = 0.08, p = 0.7839; main effect of onset and offset: F1(1,23) = 2.361, p = 0.1380; F2(1,60) = 0.6, p = 0.4414; and interaction of vowel and onset/offset: F1(1,23) = 0.683, p = 0.4169; F(1,60) = 0.13, p = 0.7203). Given the lack of significant effects in item analyses and our between-item manipulation of vowel and onset/offset mismatch, findings from the analysis across participants are potentially false-positives. We did not have specific hypotheses regarding the influence of these other factors, so further studies are needed to follow up on how vowel identity and position of mismatch influence perception and neural representations.
Univariate magnitude of BOLD activity increases during perception of mismatch
Next, we analyzed fMRI responses to assess how the magnitude of neural responses differed between trials in which matching and mismatching text preceded spoken words. We replicated previous results (Sohoglu et al., 2012; Blank and Davis, 2016) showing significantly greater activity for total mismatch than match trials in the bilateral STS [p < 0.05 familywise error (FWE)-corrected; Fig. 4, Table 2]. We further showed that the magnitude of the BOLD signal was increased for partial mismatch pairs heard as “different” compared with the same word pairs heard as “same” in a largely overlapping brain network including the left STS (Fig. 4, Table 3). Brain regions in and around the left pSTS have long been known to support perceptual processing of speech (Scott and Johnsrude, 2003; Hickok and Poeppel, 2007) and to integrate expectations from different modalities with speech input (Noppeney et al., 2008; Sohoglu et al., 2012; Blank and von Kriegstein, 2013; Blank and Davis, 2016).
Univariate fMRI results. A, Whole-brain fMRI analysis showing overlapping response increases in the left STS for two key contrasts: total mismatch (“different” response) > match (“same” response; blue) and partial mismatch (“different” response) > partial mismatch (“same” response; green). Overlapping responses are shown in cyan (both contrasts are displayed at p < 0.001, uncorrected but reach p < 0.05 FWE cluster-corrected significance in left STS; Tables 2 and 3). B, BOLD parameter estimates versus rest in the left pSTS extracted from the overlapping region activated for the two contrasts: total mismatch (“different” response) > match (“same” response) and partial mismatch (“different” response) > partial mismatch (“same” response). Error bars show the SEM over participants after between-participant variance is removed and is thus suitable for repeated-measures comparisons.
Univariate fMRI analysis: total mismatch “different” percept > match “same” percept displayed at p < 0.001 uncorrected and >10 voxels per cluster
Univariate fMRI analysis: partial mismatch “different” percept > partial mismatch “same” percept displayed at p < 0.001 uncorrected and >10 voxels per cluster
In addition, we examined the magnitude of the univariate activity in the overlapping left pSTS region identified using total mismatch (“different” response) > match (“same” response) and partial mismatch (“different” response) > partial mismatch (“same” response) (Fig. 4B). We did neither find a significant difference between total mismatch (“different” response) and partial mismatch (“different” responses): t(23) = 1.6172, p = 0.1195, nor between match (“same” response) and partial mismatch (“same” responses): t(23) = 0.9782, p = 0.3381. We also observed a difference in univariate activation in the left postcentral gyrus. This is plausibly due to differential difficulty of the button presses responses that participants made with the right hand and need not reflect a speech-specific process. However, because the STS has not been shown to process finger movements, it seems implausible that a similar explanation could apply to differential activity for match and mismatch trials in the STS.
Neural representations of deviating, not common, sounds are linked to (mis)perception
We used multivariate, representational similarity analysis (Kriegeskorte et al., 2008; Nili et al., 2014) to distinguish between representations of deviating and common sounds in partial mismatch trials. We defined an independent STS ROI based on the contrast of total mismatch > match trials at p < 0.001 uncorrected inclusively masked with superior and middle temporal gyrus regions from the Oxford–Harvard atlas (Desikan et al., 2006). In this search volume, we first test for similarity between partial mismatch word pairs that shared common sounds between prior and spoken word at syllable onset (e.g., kit–kitsch, kip–kick) and offset (e.g., kit–pit, writ–wit) or deviating sounds between prior and spoken word at syllable onset (e.g., kit–pit, kip–pip) and offset (e.g., kip–kick, pip–pick) (Table 1). These analyses showed a significant representation of deviating sounds for searchlight locations in our STS ROI [x = −63, y = −40, z = 9, pFWE small volume corrected (pFWEsvc) = 0.017, t(23) = 3.18] and a marginally significant trend for representations of common sounds in the same region (x = −66, y = −34, z = 12, pFWEsvc = 0.059, t(23) = 2.58). However, a paired t test revealed no significant difference between these representations (pFWEsvc = 0.656, t(23) = 0.41). This analysis provides some limited evidence for neural representations of deviating sounds in partial mismatch trials and is equivocal concerning representations of common sounds. Therefore, the results provide no clear evidence to favor one or other type of neural representation.
To determine whether representations of deviating or common sounds in the STS better explain perceptual outcomes, we conducted a further multivariate analysis that included participant-specific measures of the rate of perception and misperception for common and deviating sound groups (Table 1). To do this, we averaged the rate of “different” responses across the four word pairs contributing to each common or deviating sound group and rank ordered these groups in terms of the rate of accurate perception or misperception. If representations of common sounds in partial mismatch trials determine perception, then stronger representations of these common sounds should correlate with more frequent “same” responses (i.e., misperception; Fig. 5A). Conversely, if representations of deviating sounds determine perception, then stronger representations of these sounds should be apparent for partial mismatch pairs that are more often perceived as “different” (i.e., correct perception; Fig. 5B). For this analysis, we used behavioral measures from individual participants rank transformed separately for each of the two vowel sets (/I/ and /ä as in “kick” and “tall”) and for onset/offset mismatch pairs (see Table 1). This ensured that these analyses exclude otherwise uninteresting differences between the rate of perception/misperception for the two vowels and onset/offset mismatches. Using rank correlations, we tested for any monotonic relationship between perceptual outcomes and neural representations without requiring a linear relationship.
RSA predictions, methods, and pSTS results. A, Predicted correlation if misperception is associated with representations of common sounds in partial mismatch trials. If neural representations preferentially represent common sounds in written/spoken word pairs then partial mismatch pairs that share the same common sounds (e.g., kit–kick, kit–kitsch, etc.) should generate similar neural representations (groups of items with the same common sounds are indicated by the same color). Clearer representations of common sounds (i.e., increased neural similarity within groups) should lead to confirmation of the prior (i.e., “same” responses, misperception), whereas less clear representations lead to rejection of the prior (“different” responses, correct perception). Therefore, representation of common sounds in partial mismatch trials predicts a negative correlation between neural similarity and perception. B, Predicted correlation if perception is associated with representations of deviating sounds in partial mismatch trials. If neural representations preferentially represent deviating sounds in written/spoken word pairs, then partial mismatch pairs that share the same deviating sounds (e.g., kip–kick, whip–wick, etc.) should generate similar neural representations (groups of items with the same deviating sounds are indicated by the same shape). Clearer representations of these deviating sounds (i.e., increased neural similarity within groups) should lead to rejection of the prior (i.e., “different” responses, correct perception), whereas less clear representations of deviating sounds lead to confirmation of the prior (“same” responses, misperception). Therefore, representation of deviating sounds in partial mismatch trials predicts a positive correlation between neural similarity and perception. C, D, Hypothesis RDMs for comparisons of word pairs that shared the same common sounds in written/spoken word pairs (e.g., for offset mismatch pairs containing the vowel /I/: kit-kitsch, kick-kip; here the common sounds /kI/ are the same for these word pairs; C) or shared the same deviating sounds in written/spoken word pairs (e.g., for offset mismatch pairs containing the vowel 1: kit–kitsch, pit–pitch despite the different spellings these contain the same deviating /t/ and /t∫/ sounds; D). In other hypothesis RDMs (data not shown), we applied the same principle for onset mismatch pairs such as pick–kick and for the vowel /ä/ as in tall–call. We can supply the other RDMs for interested readers on request. For visualization, we show a hypothesis RDM based on the average ranking of “different” responses across participants; for analysis, different rankings were used based on behavioral data from individual participants. E, Search volume in the left STS used in these analyses was defined from an independent univariate contrast total mismatch (“different” response) > match (“same” response) (see Materials and Methods and Fig. 3A) masked to confine analysis to the superior temporal sulcus. F, G, Correlation of neural similarity and perceptual outcomes. Results are visualized for 16 data points for four different sets of word pair groups and the factorial crossing of offset/onset partial mismatch word pairs containing the two vowels /I/ and /ä/ as in “kick” and “tall.” Lines show the least-square fit to the data. F, Common sound groups. When partial mismatch trials were grouped by common sounds, neural similarity did not correlate with perception as hypothesized (cf. A). G, Deviating sound groups. When partial mismatch trials are grouped by deviating sounds, neural similarity correlated positively with perception as hypothesized (B). Results and least-square fit lines are as in F.
In the searchlight analysis, we correlated neural RDMs from each searchlight sphere with two hypothesis RDMs containing behavioral responses as similarity weights for word pairs either grouped based on the deviating sounds or on common sounds in the item pairs (schematically depicted in Fig. 5A,B). When we applied small volume correction for our STS search volume (Fig. 5E), there was a positive correlation between single-subject measures of perception (i.e., “different” responses) with neural representations of deviating sounds (pFWEsvc = 0.01, t(23) = 3.46, x = −66, y = −25, z = 5) and no correlation of misperception (i.e., “same” responses) with representations of common sounds (pFWEsvc = 0.693, t(23) = 0.28). A paired t test further showed that the correlation with deviating representations was more reliable than the correlation with representations of common sounds (deviating vs common sound groups paired t test: pFWEsvc = 0.042, t(23) = 2,74, x = −66, y = −25, z = 5). To visualize the outcome of this analysis (Fig. 5F,G), we computed the average neural similarity among the four item pairs within each group for each participant and averaged the rank-ordered item pairs over participants based on the proportion of “different” responses (i.e., as shown schematically for one mismatch position and vowel in Fig. 5A,B).
We supplemented this searchlight analysis by extracting a Fisher's z-transformed Spearman correlation value for each of the two analyses with a pattern similarity computed for the whole of the ROI. BOLD pattern similarity computed over all voxels in the ROI correlated with individual participants' rates of “different” responses for word pairs grouped according to deviating sounds (r = 0.0858, one-sample t test: t(23) = 2.5715, p = 0.0171). Furthermore, the equivalent correlation was nonsignificant for common representations; higher rates of responding “same” were not correlated with representational similarity for words pairs grouped according to common sounds (r = −0.0253, one-sample t test: t(23) = −0.7946, p = 0.4350). Again, a comparison of these two correlations with a paired t test showed significantly more reliable correlations between perceptual outcomes with prediction error representations than with expected representations (t(23) = 2.6472, p = 0.0144).
To ensure that effects in other brain areas were not missed, we also inspected whole-brain searchlight results for these three multivariate analyses (Fig. 6). This did not reveal any further areas that reached a whole-brain-corrected threshold, but showed two clusters in left motor and frontal regions at p < 0.001 uncorrected for the paired t test comparing deviating versus common sound groups (Table 4). The left motor cluster was also observed for the correlation between behavioral responses and representations of deviating sounds in partial mismatch trials (Table 5). No searchlight locations reached p < 0.001 uncorrected for correlation with representation of common sounds.
Results of the whole-brain searchlight RSA approach (shown at p < 0.001, uncorrected for clarity). The paired t test comparing the correlation between single-subject perception and neural representations of common versus deviating sounds is shown in red. The correlation between single-subject perception and representations of deviating sounds is shown in yellow. Corresponding coordinates are reported in Tables 4 and 5.
RSA fMRI analysis: paired t test comparing the correlation between single-subject perception and representations of common versus deviating sounds reported at p < 0.001 uncorrected
RSA fMRI analysis: correlation between single-subject perception and representations of deviating sounds reported at p < 0.001 uncorrected
In a further exploratory analysis and to generate hypothesis for future studies, we also examined neural representations in another cluster that showed activation differences in the univariate analysis. Specifically, we examined the cluster in the left middle frontal gyrus revealed by the independent univariate contrast 'total mismatch (“different” response) > match (“same” response)' (peak at x = −32, y = 20, z = 32) for small volume correction because this region has previously been reported to contain representations of prior information during perception of degraded speech (Blank and Davis, 2016; Sohoglu and Davis, 2016; Cope et al., 2017). Here, we found a significant representation of deviating sound groups (x = −33, y = 11, z = 31, pFWEsvc = 0.004), no representation of common sound groups (x = −27, y = 26, z = 35, pFWEsvc = 0.631). This difference was also significant in a paired t test x = −39, y = 17, z = 28, pFWEsvc = 0.007).
Discussion
Misperceiving spoken words is a common, everyday experience with outcomes that range from shared amusement to serious miscommunication. For hearing-impaired individuals, frequent misperception can lead to social withdrawal and isolation, with severe consequences for wellbeing (Dalton et al., 2003). In this work, we specify the neural mechanisms by which prior expectations, which are so often helpful for perception, can lead to deception when perceiving degraded sensory signals.
We induced frequent misperception of speech by providing clear prior expectations (written text) that partially matched/mismatched with degraded spoken words. Listeners often reported that a spoken word with one mismatching sound was the same as previously presented text (e.g., reporting that pairs such as pick–kick or pick–pip are the “same”). Behavioral results revealed that perceptual outcomes for these pairs were more similar to perceptual outcome for other word pairs that shared the same deviating sounds (i.e., “−p + k” or “−k + p” in the examples above) than for word pairs that shared the common sounds (i.e., “ick” or “pi”). However, this behavioral observation does not determine the underlying neural mechanisms that support perception and misperception of speech.
Our fMRI data showed reductions in the magnitude of the univariate BOLD signal in the left pSTS (Fig. 4) for written/spoken word-pairs that are heard as “same”. This effect does not seem to reflect passive adaptation because the magnitude of the reduction does not depend on the number of shared/deviating segments (i.e., partial mismatch and total mismatch trials respond similarly), but rather on the perceptual outcome (i.e., whether participants respond “same” or “different”). Therefore, the influence of prior knowledge on lower-level speech processing is linked to trial-by-trial perceptual outcomes (i.e., detecting deviating sounds in partial mismatch pairs). However, these response reductions do not determine the neural mechanisms responsible (Blank and Davis, 2016; Aitchison and Lengyel, 2017). Reduced univariate activity for matching trials could be due to either more efficient/less effortful processing of common sounds (Murray et al., 2004; Kok et al., 2012; Blank and Davis, 2016) or to suppressed processing of common sounds (i.e., explaining away; Murray et al., 2004; Friston, 2005; Blank and Davis, 2016). Both of these proposals can explain reductions in the magnitude of neural responses for partially matching trials that are heard as “same” and other similar findings from repetition suppression designs. Therefore, we used RSA fMRI to measure representational content in the pSTS to specify the neural mechanisms by which listeners combine prior knowledge and degraded sensory signals. Specifically, we can decode whether the repeated (i.e., expected) or the nonrepeated (unexpected) part of the stimulus is preferentially represented in the pSTS and thus how representations of (un)expected elements of degraded stimuli are linked to perception.
The findings of our multivariate fMRI analyses confirm representations of prediction error in the STS. Neural representations of deviating sounds were correlated with perceptual outcomes; that is, neural representations of prediction error were more apparent for trials in which written/spoken mismatch was detected. The equivalent correlation with perceptual outcomes for representations of expected sounds was nonsignificant (and showed a numerical trend in the nonpredicted direction). Furthermore, there was a significant difference between the positive correlation for deviating sounds and the null correlation for common sounds. Although our methods do not permit us to draw conclusions from the absence of a significant effect, we note that effect sizes for our reliable multivariate analyses are consistent with those seen in previous, similar fMRI studies (Evans and Davis, 2015; Blank and Davis, 2016).
We therefore conclude that neural representations of prediction error are apparent in the pSTS and linked to perceptual outcomes during perception of degraded speech. These findings are best explained by the proposal that neural representations in the pSTS signal prediction error; that is, representations of the speech sounds that deviate from prior expectations. These findings are well explained by accounts of speech perception that assign an important role to predictive coding computations (Arnal et al., 2011; Giraud and Poeppel, 2012; Blank and Davis, 2016).
Our previous work also provided evidence for a predictive coding account of speech perception by showing (in the pSTS) an interaction such that increased sensory detail had opposite effects on multivariate speech representations after neutral and matching text (Blank and Davis, 2016). Although differences in the neural representation of speech in this previous work could be due to changes in listening strategy (e.g., listeners anticipating that degraded speech will be harder to understand following neutral text or being distracted by prior presentation of written text), these alternative explanations could not apply to the present study in which all spoken words were preceded by written text. The present study also goes beyond our previous work by directly linking perceptual outcomes to neural representation of prediction error. That is, trials that evoke clearer neural representations of deviating sounds (i.e., prediction errors) in the pSTS lead to more accurate perception.
Alternative theories of speech perception, most notably interactive activation accounts such as the TRACE model (McClelland and Elman, 1986), have proposed that perception depends on joint activation of common representations between prior expectations and speech signals. Our experimental design allowed several tests for the representation of these common sounds during partial mismatch trials, but our fMRI data provides no evidence for neural representations of expected sounds as proposed by interactive activation models. Therefore, instead of using top-down processes to support the perception of expected sounds (a mechanism that has been criticized previously as being too vulnerable to hallucination; Norris et al., 2000), we propose that neural representations of prediction error play a critical role in achieving accurate perception of speech. Listeners use representations of prediction error as a signal to update or overrule prior expectations when these are incompatible with incoming signals. Stronger prediction error signals therefore lead to correct rejection of prior expectations and more accurate perception of degraded speech. Although our findings challenge interactive activations accounts of perception (McClelland and Elman, 1986), we cannot rule out some predictive coding theories (Rao and Ballard, 1999; Friston, 2005) in which representations of prediction error and expected sounds (i.e., top-down predictions) are computed in parallel in different sets of neurons or cortical laminae. It remains to be seen whether other methods (e.g., laminar-specific analysis of ultra-high-field fMRI) can be used to demonstrate a representation of expected sounds that are detected in degraded signals or if expected sounds are not represented directly in neural responses.
One avenue for future investigation could be to explore other influences of prior knowledge in perception. For example, recent multivariate fMRI studies have shown changes to neural representations of ambiguous speech sounds due to adaptation or phonetic recalibration training (Kilian-Hütten et al., 2011; Bonte et al., 2017). These decoding techniques demonstrate that neural representations in the pSTS can discriminate between different perceptual outcomes for ambiguous sounds due to learning. However, so far, these findings do not reveal the mechanisms underlying these neural representations; that is, they do not distinguish between sharpening and prediction error mechanisms (Kilian-Hütten et al., 2011; Bonte et al., 2017), although other studies have shown common neural changes due to prior knowledge and perceptual learning consistent with predictive coding (Sohoglu and Davis, 2016). Future work could test these claims using multivariate fMRI methods. Critically for computations of prediction error, correspondences between sensory signals and prior expectations can either enhance or suppress informative neural representations (depending on signal quality and perceptual outcomes; Blank and Davis, 2016), whereas sharpening accounts propose that neural representations of signals are always enhanced by accurate expectations. Further tests of these proposals in the context of perceptual learning would be informative.
In addition to the laboratory-induced occurrences of speech misperception that we have studied here, prediction error representations have the potential to explain more ecologically and clinically significant instances of misperception. For example, in naturally occurring slips of the ears, listeners typically report incorrect, but phonological and lexically well formed content words while adding or modifying function words to generate plausibly structured phrases and sentences (Bond, 2005). Therefore, real-world misperception of speech involves both sensory confusions (i.e., content words are misidentified) and the filling in of predicted words. These observations seem to follow naturally from an account in which misperception derives from weak representations of prediction error. Older individuals have a double vulnerability to speech misperception; age-related hearing loss is the most common sensory impairment in old age (Roth et al., 2011) and, even when intelligibility is equated, older listeners are more likely than younger listeners to report a predictable but incorrect word (Rogers and Wingfield, 2015). This is consistent with a novel proposal derived from the current study that impaired sensory processing in older listeners leads to a systematic reduction in the strength or efficacy of prediction error representations.
Our account of misperception based on inadequate prediction error representations is also relevant to abnormal perceptual experience arising from overapplication of prior beliefs about the world without incoming sensory information (i.e., hallucinations). Inappropriate integration of prior expectations could lead to verbal hallucinations ranging from voice hearing in individuals without any clinical diagnosis (Alderson-Day et al., 2017) to the more distressing experiences reported by individuals with schizophrenia (Fletcher and Frith, 2009). Recent work has shown that individuals with early psychosis and healthy individuals at risk of psychosis show a greater reliance on prior knowledge during perception of visually degraded images (Teufel et al., 2015). Our observations of neural representations that underpin prior knowledge-induced misperceptions of speech may therefore assist in exploring the origins of auditory–verbal hallucinations in psychosis.
The present findings show that representations of prediction error determine perceptual outcomes in listening conditions that lead to frequent misperceptions. Most descriptive theories explain illusory perception as arising from sensory representations of features or sounds that are supported by prior expectations (Gregory, 1997). Our work instead provides support for a complementary proposal; namely that misperception occurs when there is an insufficient sensory representation of the difference between expectations and sensory signals. Sensory prostheses or other neural interventions (Moore and Shannon, 2009; Zoefel and Davis, 2017) that enhance representations of prediction error may thereby improve the accuracy of speech perception in hearing-impaired individuals.
Footnotes
This work was supported by the UK Medical Research Council (Grant RG91365/SUAG/008 to M.H.D.) and the European Union Horizon 2020 Programme (Marie Curie Fellowship 703635 to H.B.). We thank Yaara Erez, Jenni Rodd, Ediz Sohoglu, and Arnold Ziesche for valuable comments on a previous version of this manuscript and Helen Lloyd and Steve Eldridge for assistance in radiography.
The authors declare no competing financial interests.
- Correspondence should be addressed to Helen Blank, Department of Systems Neuroscience, University Medical Center Hamburg, Eppendorf Martinistr. 52, 20248 Hamburg, Germany. hblank{at}uke.de