Analysis of speech segments. A, Variation of sound pressure level over time for a representative utterance from the TIMIT corpus (the sentence in this example is “She had your dark suit in greasy wash water all year”). B, Blowup of a 0.1 sec segment extracted from the utterance(in this example the vowel sound in“dark”). C, The spectrum of the extracted segment in B, generated by application of a fast Fourier transform.
Statistical characteristics of spoken American English based on an analysis of the spectra extracted from the >100,000 segments (200 per speaker) in the TIMIT corpus. Mean normalized amplitude is plotted as a function of normalized frequency, the maxima indicating the normalized frequencies at which power tends to be concentrated. A, The normalized probability distribution of amplitude-frequency combinations for the frequency ratio range 1-8. B, Mean normalized amplitude plotted as a function of normalized frequency over the same range. C, Blowup of the plot in B for the octave interval bounded by the frequency ratios 1 and 2. Error bars show the 95% confidence interval of the mean at each local maximum. D, The plot in C shown separately for male (blue) and female (red) speakers.
Statistical structure of speech sounds in Farsi, Mandarin Chinese, and Tamil, plotted as in Figure 2 (American English is included for comparison). The functions differ somewhat in average amplitude, but are remarkably similar both in the frequency ratios at which amplitude peaks occur, and the relative heights of these peaks.
Consonance rankings predicted from the normalized spectrum of speech sounds. A, Median consonance rank of musical intervals (from Fig. 6) plotted against the residual mean normalized amplitude at different frequency ratios. B, Median consonance rank plotted against the average slope of each local maximum. By either index, consonance rank decreases progressively as the relative concentration of power at the corresponding maxima in the normalized speech sound spectrum decreases.
Probability distribution of the harmonic number at which the maximum amplitude occurs in speech sound spectra derived from the TIMIT corpus. A, The distribution for the first 10 harmonics of the fundamental frequency of each spectrum. More than 75% of the amplitude maxima occur at harmonic numbers 2-5. B, The frequency ratio values at which power concentrations are expected within the frequency ratio range 1-2 (Fig. 2C) when the maximum amplitude in the spectrum of a periodic signal occurs at different harmonic numbers. There are no peaks in Figure 2 at intervals corresponding to the reciprocals of integers >6, reflecting the paucity of amplitude maxima at harmonic numbers >6 (A). See Materials and Methods for further explanation.
Comparison of the normalized spectrum of human speech sounds and the intervals of the chromatic scale. A, The majority of the musical intervals of the chromatic scale (arrows) correspond to the mean amplitude peaks in the normalized spectrum of human speech sounds, shown here over a single octave (Fig. 2C). The names of the musical intervals and the frequency ratios corresponding to each peak are indicated. B, A portion of a piano keyboard indicating the chromatic scale tones over one octave, their names, and their frequency ratios with respect to the tonic in the three major tuning systems that have been used in Western music. The frequency ratios at the local maxima in A closely match the frequency ratios that define the chromatic scale intervals.
Consonance ranking of chromatic scale tone combinations (dyads) in the seven psychophysical studies reported by Malmberg (1918), Faist (1897), Meinong and Witasek (1897), Stumpf (1898), Buch (1900), Pear (1911), and Kreuger (1913). Graph shows the consonance rank assigned each of the 12 chromatic dyads in the various studies reported. The median values are indicated by open circles connected by a dashed line.