Abstract
When an acoustic signal is temporarily interrupted by another sound, it is sometimes heard as continuing through, even when the signal is actually turned off during the interruption—an effect known as the “auditory continuity illusion.” A widespread view is that the illusion can only occur when peripheral neural responses contain no evidence that the signal was interrupted. Here we challenge this view using a combination of psychophysical measures from human listeners and computational simulations with a model of the auditory periphery. The results reveal that the illusion seems to depend more on the overall specific loudness than on the peripheral masking properties of the interrupting sound. This finding indicates that the continuity illusion is determined by the global features, rather than the fine-grained temporal structure, of the interrupting sound, and argues against the view that the illusion arises in the auditory periphery.
Introduction
When a sound is briefly occluded by another, we usually hear it continuing through the occluder. This perceptual continuity makes it easier for the listener to follow sounds of interest (e.g., a voice or musical instrument) despite occasional interference by extraneous sounds (e.g., another voice or background noise). Remarkably, a sound can be heard as continuous even if it is actually turned off while the occluder is present, giving rise to a phenomenon known as the auditory continuity illusion (Miller and Licklider, 1950).
The neural basis of this illusion and the conditions required for its occurrence are still being explored. According to a widespread view, the continuity illusion can occur only if peripheral neural responses provide no sensory evidence that the target (T) sound was absent during the occlusion. This implies that neural excitation produced by the occluder in peripheral auditory “channels” [e.g., individual fibers within the auditory nerve (AN)] must completely mask the excitation that would have been evoked by the target, had the target actually been present (Warren et al., 1972; Bregman, 1990, p. 352). This “peripheral masking” theory (for review, see Bregman, 1990; Warren, 1999; Petkov and Sutter, 2011) is supported by the results of several psychophysical studies, which suggest that the illusion occurs only if the occluder is loud enough to render the target inaudible, had the target actually been present in the occluder (Warren et al., 1972; Warren et al., 1988; Kluender and Jenison, 1992; Riecke et al., 2008).
Although a large number of experimental findings appear to support the peripheral masking theory (Houtgast, 1972; Duifhuis, 1980; Plack and Oxenham, 2000), the interpretation of these findings has focused almost exclusively on the long-term (average) power spectrum of the target and the masker (M): less attention has been paid to how the fine-grained temporal characteristics of these sounds influence the illusion. As a result, the nature of neural representations underlying the continuity illusion remains poorly understood. In particular, it is unclear whether the illusion arises from temporally integrated neural representations of the target and the occluder (e.g., average spike rates, which are encoded by central auditory neurons) or from a more fine-grained neural representation of the sound waveform (e.g., the precise spike timing of peripheral auditory neurons).
In this study, we took advantage of the fact that two sounds with identical long-term power spectra, but different phase spectra, can elicit very different temporal response patterns in the peripheral auditory system to shed new light on the nature of representations underlying the continuity illusion, and to provide a strong test of the peripheral masking theory.
Materials and Methods
Stimuli and tasks.
The target sound was a 1500 Hz pure tone, and the masker was a complex harmonic tone consisting of the first 33 harmonics of a 100 Hz fundamental frequency (Fig. 1A). Two different masker conditions (denoted M+ and M−) were produced by manipulating the phase spectrum of the masker according to an equation proposed by Schroeder (1970). The starting phase of the nth harmonic (n = 1, …,33), ϕ(n), was set to: where d denotes a random phase shift drawn from a uniform (−π;π] distribution on each stimulus presentation, and the sign of m depends on the condition tested (m = +1 for the M+ condition, and m = −1 for the M− condition). The waveforms of the resulting sounds resemble a periodic series of downward or upward linear frequency glides (for M+ and M−, respectively) (Fig. 1C,E).
An important property of these sounds is that, despite having identical long-term power spectra, they evoke very different neural response patterns in the peripheral auditory system due to phase dispersion in the cochlea (Recio and Rhode, 2000). Specifically, the envelopes of basilar membrane (BM) responses to M+ (as simulated using a physiologically inspired computer model; see Peripheral auditory model) are strongly modulated or “peaky” (Fig. 1C), whereas those to M− are considerably “flatter” (Fig. 1E). This difference can be explained by the fact that the positive phase curvature of M+ compensates for the negative phase curvature of cochlear filtering (Recio and Rhode, 2000; Summers et al., 2003), so that cochlear responses to the harmonics of this complex are more in phase with each other. In contrast, the negative phase curvature of M− adds to the negative phase curvature of the cochlear filters, resulting in an accentuated phase dispersion effect. These differences in cochlear responses to M+ and M− lead to clear differences in the spiking patterns of AN fibers. As illustrated in Figure 1, C and E, the simulated peripheral response pattern to M+ is more peaky than that to M−, with pronounced temporal gaps within each period.
The marked differences in peripheral response to M+ and M− can explain why the two sounds have very different masking properties: the masked threshold of a pure tone embedded in M− can be ≥20 dB higher than its threshold within an M+ complex with the same overall physical level (Smith et al., 1986; Kohlrausch and Sander, 1995; Lentz and Leek, 2001; Oxenham and Dau, 2001b). This effect can be understood by noting that when the tone target is added to M+, the resulting neural response pattern is markedly different from that produced by M+ alone: the tone can easily be detected between the sharp peaks elicited by M+ (Fig. 1D). By contrast, adding the tone to M− induces changes in response pattern that are much subtler (Fig. 1F) (Rupp et al., 2008).
Based on these observations, we made the following predictions: if the generation of the continuity illusion depends on a fine-grained analysis of peripheral neural responses by the central auditory system, then the level of the target below which the continuity illusion occurs (the continuity–illusion threshold) should be considerably lower with the M+ masker than with the M− masker, in line with the differences in masked threshold. Conversely, if the continuity illusion depends on a relatively coarse representation of the masker (e.g., a representation of its overall level or specific loudness), the illusion thresholds measured with the M+ and M− maskers should be reasonably similar (i.e., their difference should be considerably smaller than the differences in the masked threshold).
To test these predictions, we conducted three experiments, each comprising the two masker conditions (M+ and M−). In the continuity–illusion (C) experiment, the T and the M were 200 ms each, and were played in an alternating sequence: TMTMTMT (Fig. 1A). The task of the listeners was to indicate whether they heard the target “continue through” the masker or not. The level of the target was varied adaptively to determine the continuity–illusion threshold (see Procedure).
In the “simultaneous-masking” (S) experiment, the 200 ms target was presented simultaneously with, and temporally centered in, the 600 ms masker in one of two consecutive observation intervals (selected at random with equal probability), while the other observation interval contained only the masker. The task of the listeners was to indicate which of the two observation intervals contained the target. The aim of this experiment was to provide a direct measure of the amount of simultaneous masking produced by the masker.
In the “forward-masking” (F) experiment, the 10 ms target was presented 5 ms after the 200 ms masker in one observation interval (first or second, at random with equal probability), while the other interval contained only the masker. The task of the listeners was to indicate which of the two observation intervals contained the target. The aim of this experiment was to provide an estimate of the strength of the overall neural response pattern (i.e., the long-term internal excitation) produced by the masker (Carlyon and Datta, 1997).
All sounds (target and masker) were ramped on and off with 20 ms linear amplitude ramps, except for the F experiment, in which the 10 ms target was ramped with 5 ms ramps (no steady state). In the C experiment, the amplitude midpoints of the ramps of consecutive elements (T and M) were made to coincide to reduce the potential audibility of gaps in the TMTMTMT sequence. The two observation intervals in the S and F experiments were separated by a 500 ms silent gap.
Procedure.
Psychophysical thresholds were measured by varying the level of the target. The level of the masker was kept constant throughout at 50 dB SL per harmonic (expressed relative to the listener's hearing threshold at the frequency of the target). In experiment C, each trial involved a single presentation of the TMTMTMT sequence. Continuity–illusion thresholds were measured using a one-up one-down rule tracking the level at which the target was judged continuous 50% of the time. Thresholds in experiments S and F were measured with an adaptive two-alternative forced-choice procedure and a two-down one-up tracking rule estimating the 70.7% correct point on the psychometric function (Levitt, 1971).
The amount by which the target level was changed within the adaptive tracking procedure was initially set to 6 dB; it was reduced to 3 dB after the second reversal in the direction of change of the target level (from increasing to decreasing or vice versa), and to 0.5 dB after the fourth reversal. Each measurement began with the target level set sufficiently high (62 dB SL) so that the listener did not hear the target continue through the masker in experiment C, and that the listener could easily detect the target in experiments S and F. The procedure terminated after the 10th reversal. Threshold was computed as the arithmetic mean of the target levels at the last six reversals. For each listener and each masker condition (M+ and M−), nine, six, and six thresholds were measured in experiments C, S, and F, respectively, in fully randomized order.
Before the measurements, the listeners were provided with written instructions, and they listened to example stimuli in which the target was clearly present or absent (for experiments S and F), and clearly discontinuous or continuous through the maskers (for experiment C). For experiment C, listeners were instructed to ignore the interrupting sound as much as possible and focus on the target to judge whether it was continuous or discontinuous. They practiced several runs of the task before the first measurement. In experiment C, no feedback was provided to the listener. In experiments S and F, visual feedback was provided after each trial.
Apparatus.
Stimuli were generated digitally and played via a soundcard (ALC888, Realtek) with 16 bit resolution and 44.1 kHz sampling frequency. The stimuli were presented monaurally to the listener's left ear through headphones (MDR-V900HD, Sony) in a sound-attenuating chamber. Monaural stimulation was used to avoid potential binaural interactions and facilitate interpretation of the results; the left ear was chosen arbitrarily. Responses were collected via two buttons. Stimulus presentation and response collection were controlled using the AFC software package (developed by Stephan Ewert, University of Oldenburg, Oldenburg, Denmark, and Technical University of Denmark, Copenhagen, Denmark) under Matlab.
Participants.
Eleven paid volunteers (five females; age range, 20–35 years; mean age, 25.3 years) participated in the study after providing written informed consent. They had normal hearing [defined as pure-tone hearing thresholds of <20 dB hearing level (HL) at 125, 250, 500, 1000, 1500, 2000, and 4000 Hz], except for one participant who had a slightly elevated threshold (35 dB HL) at 4000 Hz, and no history of hearing disorders. Two of the 11 participants were author L.R. and a research assistant. Another two participants were tested using a higher masker level (+10 dB/harmonic). Excluding the data of these four participants from the analyses did not alter the conclusions of the study. The local ethics committee (Ethische Commissie Psychologie at Maastricht University) approved the study procedures.
Statistical analysis.
Differential masker effects (Δ) were computed by subtracting the averaged thresholds in the M+ condition from those in the M− condition. The distributions of listeners' mean thresholds and Δ values did not diverge significantly from normality as assessed by Kolmogorov–Smirnov tests, and no outliers (values beyond 3 SDs from the mean) were detected; therefore, statistical analyses were conducted with parametric tests for repeated measures. Mauchly's test indicated that the assumption of sphericity was violated for the interaction term (masker × experiment: χ2 = 10.97, p = 0.0041) and one factor (experiment, in M− condition: χ2 = 6.61, p = 0.037); therefore, degrees of freedom were corrected using Greenhouse–Geisser estimates of sphericity (ε = 0.59 and ε = 0.66, respectively). Reported probability values were corrected for multiple comparisons such that the false discovery rate was controlled at a level of 0.05.
Peripheral auditory model.
BM and AN responses were simulated using a computational model of the auditory periphery, which involved the following steps. First, bandpass filtering was used to simulate attenuation by the external and middle ears (Glasberg and Moore, 2006). Second, a bank of gammachirp filters (Irino and Patterson, 1997) with equidistant characteristic frequencies (on the equivalent rectangular bandwidth scale) and phase response altered to better simulate the phase curvature of auditory filters (Oxenham and Dau, 2001a) was used to simulate cochlear filtering. Third, cochlear compression was simulated by applying amplitude-envelope compression to the gammachirp filter output (Sachs et al., 1989). Finally, the compressed BM displacement was transformed into instantaneous spike rate, taking into account the degree of synchronization (phase locking) (Colburn, 1973) and rate-level functions of cat AN fibers (Sachs et al., 1989). The simulated sound pressure levels of target and masker were set to match to those in the M+ condition of the simultaneous-masking experiment (corresponding to listeners' average threshold) (Fig. 2A). The simulated responses are shown for filters whose center frequency equaled the frequency of the target (Fig. 1B–F).
Results
Figure 2A shows mean thresholds from the three experiments. Thresholds were lower in conditions involving the M+ masker than in conditions involving the M− masker, both overall (F(1,10) = 44.05, p < 0.001, ηp2 = 0.82) and for each experiment taken separately (experiment C: t(10) = 1.94, p = 0.049; experiment S: t(10) = 15.07, p < 0.001; experiment F: t(10) = 4.05, p = 0.0018). For conditions involving the M+ masker, thresholds were higher in experiment C than in experiment S (t(10) = 3.29, p = 0.011), and were higher in experiment F than in experiment S (t(10) = 8.55, p < 0.001). There were no significant differences between experiments for conditions involving the M− masker (F(1.32,13.16) = 0.014, p = 0.95, ηp2 = 0.001).
Importantly, the overall masker effect (i.e., Δ) depended on the nature of the experiment (masker × experiment interaction: F(1.17,11.73) = 20.25, p < 0.001, ηp2 = 0.67). As illustrated in Figure 2B, Δ was significantly smaller for experiment C than experiment S (mean ± SEM: −12 ± 2 dB; t10 = −7.34, p < 0.001), and was significantly smaller for experiment F than experiment S (t(10) = −6.78, p < 0.001), but did not differ significantly between experiments C and F (t(10) = −0.97, p = 0.36).
Discussion
Our first finding is that the auditory continuity illusion can occur under conditions where peripheral neural responses contain clear evidence that the target was interrupted: simultaneously masked thresholds for the target were much lower for the M+ masker than for the M− masker (i.e., the tone was easier to detect in the M+ masker), but this difference was reflected less clearly in thresholds for the continuity illusion. The apparent failure of the auditory system to take account of the fine-grained temporal structure of the masker occurred despite the fact that the target was turned off for ∼200 ms—much longer than the duration needed for listeners to detect the presence or absence of the target in silence. This finding casts doubt on the long-standing theory that overlapping peripheral patterns are the primary neural basis for the continuity illusion (Warren et al., 1972).
Our second finding is that thresholds for the continuity illusion follow the pattern found with forward-masked thresholds. Previous studies have suggested that forward-masked thresholds reflect the long-term (average) internal excitation produced by the masker (Carlyon and Datta, 1997; Wojtczak and Oxenham, 2009). This suggests that the continuity illusion depends on a neural representation of the average excitation produced by the occluder—possibly related to the specific loudness of the occluder (Mauermann and Hohmann, 2007)—rather than on a fine-grained temporal analysis of the excitation, as was previously assumed.
Only a few previous studies have reported continuity illusions that seemed to contradict the peripheral overlap theory. In one of these studies, the spectra of the occluder and the target tone were separated by >1.5 octaves (Sugita, 1997), a condition that would not be expected to elicit the illusion (Warren et al., 1972; Micheyl et al., 2003; Petkov et al., 2003; Riecke et al., 2008). In two other studies, the target was not interrupted by any sound (Remijn et al., 2007, 2008). In another study, the occluder was interrupted for only a small fraction of the overall stimulus duration (Haywood et al., 2011). Even fewer studies have investigated dynamic aspects of the illusion (Warren et al., 1988): Two previous studies found that a frequency-modulated pure tone, which was partially replaced by noise, was perceived as continuing through the noise, but that listeners were unable to retain the phase of the modulation during the interruption—an outcome that suggests that the fine-grained temporal structure of the target sound is not preserved during the illusion (Carlyon et al., 2004; Lyzenga et al., 2005). This finding is consistent with the results of the present study, which indicate that the continuity illusion reflects a failure of the auditory system to analyze the fine-grained temporal structure of the peripheral neural responses to sound. Another study reported continuity illusions of a train of brief impulses (clicks) through a noise, a tone, or another click train (Thurlow and Erchul, 1978). However, the interpretation of the results remained tentative, as important controls of peripheral neural activity and masking were not obtained. The present study combined these controls with more rigorous methods, including the use of complex maskers that are well suited for elucidating peripheral mechanisms for auditory scene analysis (Summers and Leek, 1998).
The fact that the continuity illusion seems to depend primarily on long-term features of the occluder (e.g., its specific loudness), rather than its fine-grained temporal features, suggests that mechanisms underlying the illusion must have relatively low temporal resolution. Therefore, the illusion is likely determined after the temporal integration of rapidly fluctuating AN patterns, an interpretation that supports previous claims that the illusion originates at relatively central processing stages (Elfner and Homick, 1967; Schreiner, 1980; Hellstrom and Young, 1989; Petkov et al., 2007).
Footnotes
L.R. is supported by the Netherlands Organization for Scientific Research (Veni Grant 451.11.014). A.J.O. and C.M. are supported by NIH Grants R01 DC07657 and R01 DC05216. We thank Gesine Malschofsky for help with data acquisition and two reviewers for useful comments.
The authors declare no competing financial interests.
- Correspondence should be addressed to Lars Riecke, Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, P.O. box 616, 6200 MD Maastricht, The Netherlands. L.Riecke{at}MaastrichtUniversity.nl