Voice Identifi cation 33
are generally two steps involved in speech analysis: spectral shaping and spectral analysis.
Each of these processing stages will be discussed in turn in the next section.
The purpose of spectral shaping is to transform the basic speech data, which is a continu-
ous time series of acoustic waveforms, into a discretized form for subsequent digital analysis.
This process of speech production per se may be included in this processing stage, though
most approaches do not incorporate a direct model of speech production process (though see
Rahman & Shimamura, 2006 for a detailed example). If it is to be included, this process
requires incorporating the generators of speech itself into the processing pipeline. The glottis,
vocal chords, trachea, and lips are all involved in the production of the waveforms associated
with speech. Each anatomical element acts in a sense as a fi lter, modulating the output (wave-
forms) from the previous element in the chain. The estimation of the glottal waveforms from
the speech waveforms is a very computationally intense task (Rahman & Shimamura, 2006).
Therefore, most studies forgo this process, which may yield a slight reduction in classifi cation
performance (see Shao et al., 2007 for a quantitative estimate of the performance
degradation).
In spectral analysis, the speech time series is analyzed at short intervals, typically on the
order of 10–30 ms. A pre-emphasis fi lter may be applied to the data to compensate for the
decrease in spectral energy that occurs at higher frequencies. The intensity of speech sound
is not linear with respect to frequency: it drops at approximately 6 dB per octave. The purpose
of pre-emphasis is to account for this effect (Bimbot et al., 2004, Rosell, 2006). Typically, a
fi rst-order fi nite impulse response (FIR) fi lter is be used for this purpose (Xafopoulos, 2001,
Rangayyan, 2002). After any pre-emphasis processing, the signal is typically converted into
a series of frames. The frame length is typically taken to be 20–30 ms, which refl ects physi-
ological constraints of sound production such as a few periods of the glottis. The overlap in
the frames is such that their centers are typically only 10 ms apart. This yields a series of
signal frame each representing 20–30 ms of real time, which corresponds to approximately
320 windows/second if sampled at 16 kHz. Note that the size of the frame is a parameter in
most applications, and hence will tend to vary around these values. It should be noted that
since the data is typically analyzed using a discrete Fourier transform (DFT), the frame/
window size is typically adjusted such that it is a power of two, which maximizes the effi -
ciency of the DFT algorithm. After framing the signal, a window is applied which minimizes
signal discontinuities at the frame edges. There are a number of windowing functions that
can be applied – and many authors opt to apply either a Hamming or a Hanning fi lter (Bimbot
et al., 2004, Orsag, 2004). With this preprocessing (spectral shaping) completed, the stage is
ready to perform the spectral analysis phase, which will produce the features required for
speech recognition.
2.1.1 Spectral Analysis
There are two principal spectral analysis methods that have been applied with considerable
success to speech data: linear prediction (LP) and cepstral analysis. Both of these approaches
assume that the data consist of a series of stationary time series sources, which is more or
less assured by the windowing preprocessing step. The basic processing steps in LP analysis
is presented in Figure 2.2. In this scheme, the vocal tract is modeled as a digital all-pole fi lter
(Rabiner & Shafer, 1978, Hermansky, 1990). The signal is modeled as a linear combination
of previous speech samples, according to equation 2.1: