Speech Signal Processing

11,127 views 49 slides May 04, 2015
Slide 1
Slide 1 of 49
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49

About This Presentation

Speech Signal Processing


Slide Content

Speech Signal Processing Murtadha Al- Sabbagh

Speech processing  is the study of speech signals and the processing methods of these signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signal. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals . Speech processing is generally can be divided as: 1-recognition (will be discussed here). 2-synthesis (will not be discussed here).

Disciplines related to speech Processing 1. Signal Processing The process of extracting information from speech in efficient manner 2. Physics The science of understanding the relationship between speech signal and physiological mechanisms 3. Pattern recognition the set of algorithms to create patterns and match data to them according to the degree of likeliness 4. Computer Science To make efficient algorithms for implementing in HW or SW the methods of speech recognitions system 5. Linguistics The relationship between sounds , words in a language , the meaning of those words and the overall meaning of sentences

Speech (phonemes) Sentences consists of words , which consists of phonemes A   phoneme  is a basic unit of a  language 's spoken sounds

Speech Waveform Characteristics Loudness Voiced/Unvoiced . Voiced: speech cords vibrating (periodic) Unvoiced: speech cords not vibrating (aperiodic) Pitch . Spectral envelope: Formants : the spectral peaks of the sound spectrum

Aspects of Speech processing we will take Pre-processing Feature extraction (Analysis) Recognition

Pre- proceesing

Pre-processing We can treat (pre-process) speech signal after it has been received by as an analog signal with three general ways: 1-In time domain (Speech Wave) 2-In frequency domain(Spectral Envelope) 3-Combination (Spectrogram) freq.. Energy 1 KHz 2 KHz

Time domain Speech is captured by a microphone , e.g. sampled periodically ( 16KHz) by an analogue-to-digital converter (ADC) E ach sample converted is a 16-bit data. If sampling is too slow, sampling may fail ( Nyquist Theorem)

A sound is sampled at 22-KHz and resolution is 16 bit. How many bytes are needed to store the sound wave for 10 seconds? Answer: One second has 22K samples , so for 10 seconds: 22K x 2bytes x 10 seconds =440K bytes *note: 2 bytes are used because 16-bit = 2 bytes

Time framing Since our ear cannot response to very fast change of speech data content, we normally cut the speech data into frames before analysis . (similar to watch fast changing still pictures to perceive motion ) Frame size is 10-30ms Frames can be overlapped, normally the overlapping region ranges from 0 to 75% of the frame size .

Time framing : Continued… For a 22-KHz/16 bit sampling speech wave, frame size is 15 ms and frame overlapping period is 40 % of the frame size. Draw the frame block diagram. Answer: Number of samples in one frame (N)= 15 ms / (1/22k)=330 Overlapping samples = 132, m=N-132=198. x=Overlapping time = 132 * (1/22k)=6ms; Time in one frame= 330* (1/22k)=15ms. i =1 (first window), length =N m N i =2 (second window) n s n time x

The frequency domain Use DFT or FFT to transform the wave from time domain to frequency domain (i.e. to spectral envelope ). | X m |= (real 2 +imginary 2 )^0.5

The frequency domain :Continued freq.. Energy 1 KHz 2 KHz

The spectrogram: to see the spectral envelope as time moves forward Specgram: The white bands are the formants which represent high energy frequency contents of the speech signal

Feature Extraction (Analysis)

feature extraction techniques

(A) Filtering Ways to find the spectral envelope Filter banks: uniform Filter banks can also be non-uniform LPC and Cepstral LPC parameters filter1 output filter2 output filter3 output Spectral envelop energy

Spectral envelope SE ar =“ar” Speech recognition idea using 4 linear filters, each bandwidth is 2.5KHz Two sounds with two Spectral Envelopes SE ar ,SE ei ,E.g. Spectral Envelop (SE) “ ar ”, Spectral envelop “ ei ” energy energy Freq. Freq. Spectrum A Spectrum B filter 1 2 3 4 filter 1 2 3 4 v1 v2 v3 v4 w1 w2 w3 w4 Spectral envelope SE ei =“ei” Filter out Filter out 10KHz 10KHz

Difference between two sounds (or spectral envelopes SE SE’) Difference between two sounds, E.g. SE ar ={v1,v2,v3,v4}=“ ar ”, SE ei ={w1,w2,w3,w4}=“ ei ” A simple measure of the difference is Dist = sqrt (|v1-w1| 2 +|v2-w2| 2 +|v3-w3| 2 +|v4-w4| 2 ) Where |x|=magnitude of x

(B) Linear Predictive coding LPC The concept is to find a set of parameters ie .  1 ,  2 ,  3 ,  4 ,..  p=8 to represent the same waveform (typical values of p=8->13) 1, 2, 3, 4,.. 8 Each time frame y=512 samples ( S ,S 1 ,S 2 ,. S n ,S N-1=511 ) 512 integer numbers (16-bit each) Each set has 8 floating point numbers (data compressed) ’1, ’2, ’3, ’4,.. ’8 ’’1, ’’2, ’’3, ’’4,.. ’’8 : Can reconstruct the waveform from these LPC codes Time frame y Time frame y+1 Time frame y+2 Input waveform 30ms 30ms 30ms For example

example A speech waveform S has the values s0,s1,s2,s3,s4,s5,s6,s7,s8= [1,3,2,1,4,1,2,4,3]. The frame size is 4. Find auto-correlation parameter r0, r1, r2 for the first frame. If we use LPC order 2 for our feature extraction system, find LPC coefficients a1, a2.

Answer: Frame size=4, first frame is [1,3,2,1] r0=1x1+ 3x3 +2x2 +1x1=15 r1= 3x1 +2x3 +1x2=11 r2= 2x1 +1x3=5

(C) Cepstrum A new word by reversing the first 4 letters of spectrum cepstrum . It is the spectrum of a spectrum of a signal.

Glottis and cepstrum Speech wave (S)= Excitation (E) . Filter (H) (H) ( Vocal tract filter ) Output So voice has a strong glottis Excitation Frequency content In Ceptsrum We can easily identify and remove the glottal excitation Glottal excitation From Vocal cords (Glottis) (E) (S)

Cepstral analysis Signal(s)=convolution(*) of glottal excitation (e) and vocal_tract_filter (h) s(n)=e(n)*h(n), n is time index After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)} Convolution(*) becomes multiplication (.) n(time)  w(frequency), S(w) = E(w).H(w) Find Magnitude of the spectrum |S(w)| = |E(w)|.|H(w)| log 10 |S(w)|= log 10 {|E(w)|}+ log 10 {|H(w)|} Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1

Cepstrum C(n)=IDFT[log 10 |S(w)|]= IDFT[ log 10 {|E(w)|} + log 10 {|H(w)|} ] In c(n), you can see E(n) and H(n) at two different positions Application: useful for (i) glottal excitation (ii) vocal tract filter analysis windowing DFT Log|x(w)| IDFT X(n) X(w) Log|x(w)| N=time index w=frequency I-DFT=Inverse-discrete Fourier transform S(n) C(n)

Glottal excitation cepstrum Vocal track cepstrum s(n) time domain signal x(n)=windowed(s(n)) Suppress two sides |x(w )|= Log (|x(w)|) C(n)= iDft (Log (|x(w)|)) gives Cepstrum

Liftering (to remove glottal excitation) Low time liftering : Magnify (or Inspect) the low time to find the vocal tract filter cepstrum High time liftering : Magnify (or Inspect) the high time to find the glottal excitation cepstrum (remove this part for speech recognition. Glottal excitation Cepstrum, useless for speech recognition, Frequency =FS/ quefrency FS=sample frequency =22050 Vocal tract Cepstrum Used for Speech recognition Cut-off Found by experiment

Reasons for liftering Cepstrum of speech Why we need this? Answer: remove the ripples of the spectrum caused by glottal excitation. Input speech signal x Spectrum of x Too many ripples in the spectrum caused by vocal cord vibrations (glottal excitation). But we are more interested in the speech envelope for recognition and reproduction Fourier Transform http://isdl.ee.washington.edu/people/stevenschimmel/sphsc503/files/notes10.pdf

Speech Recognition

Speech Recognition speech recognition  (SR) is the translation of spoken words into text. It is also known as "automatic speech recognition" (ASR), "computer speech recognition", or just "speech to text" (STT ).

speech recognition procedure We will inplement all the methods we have taken to connect all the dots and clarify the recognition system , note that only step (4) is regarded to recognition process and the other points are connected to the other parts we have taken before. Steps End-point detection (2a) Frame blocking and (2b) Windowing Feature extraction Find cepstral cofficients by LPC Auto-correlation analysis LPC analysis, Find Cepstral coefficients, Distortion measure calculations

Step1: Get one frame and execute end point detection To determine the start and end points of the speech sound It is not always easy since the energy of the starting energy is always low. Determined by energy & zero crossing rate recorded end-point detected n s(n) In our example it is about 1 second

Step2(a): Frame blocking and Windowing To choose the frame size (N samples )and adjacent frames separated by m samples. I.e.. a 16KHz sampling signal, a 10ms window has N=160 samples, m=40 samples. m N N n s n l =2 window, length = N l =1 window, length = N

Step2(b): Windowing To smooth out the discontinuities at the beginning and end. Hamming or Hanning windows can be used. Hamming window Tutorial: write a program segment to find the result of passing a speech frame, stored in an array int s[1000], into the Hamming window.

Effect of Hamming window

Step3.1: Auto-correlation analysis Auto-correlation of every frame ( l =1,2,..)of a windowed signal is calculated. If the required output is p- th ordered LPC Auto-correlation for the l- th frame is

Step 3.2 : LPC calculation To calculate LPC coefficints vector

Step3.3: LPC to Cepstral coefficients conversion Cepstral coefficient is more accurate in describing the characteristics of speech signal Normally cepstral coefficients of order 1<=m<=p are enough to describe the speech signal. Calculate c 1 , c 2 , c 3 ,.. c p from LPC a 1 , a 2 , a 3 ,.. a p

Step(4) Matching method: Dynamic programming DP Correlation is a simply method for pattern matching BUT: The most difficult problem in speech recognition is time alignment. No two speech sounds are exactly the same even produced by the same person. Align the speech features by an elastic matching method -- DP.

(B) Dynamic programming algorithm Step 1: calculate the distortion matrix dist ( ) Step 2: calculate the accumulated matrix by using D( i, j) D( i-1, j) D( i, j-1) D( i-1, j-1)

To find the optimal path in the accumulated matrix (and the minimum accumulated distortion/ distance) Starting from the top row and right most column, find the lowest cost D ( i,j ) t : it is found to be the cell at ( i,j )=(3,5), D(3,5)=7 in the top row. *(this cost is called the “ minimum accumulated distance” , or “minimum accumulated distortion” ) From the lowest cost position p( i,j ) t , find the next position ( i,j ) t-1 = argument_min_i,j {D(i-1,j), D(i-1,j-1), D(i,j-1)}. E.g. p( i,j ) t-1 = argument_min i,j {11,5,12)} = 5 is selected. Repeat above until the path reaches the left most column or the lowest row. Note: argument_min_i,j {cell1, cell2, cell3} means the argument i,j of the cell with the lowest value is selected.

Optimal path It should be from any element in the top row or right most column to any element in the bottom row or left most column. The reason is noise may be corrupting elements at the beginning or the end of the input sequence. However, in fact, in actual processing the path should be restrained near the 45 degree diagonal (from bottom left to top right), see the attached diagram, the path cannot passes the restricted regions. The user can set this regions manually. That is a way to prohibit unrecognizable matches. See next page.

Optimal path and restricted regions

Example: for DP The Cepstrum codes of the speech sounds of ‘ YES’and ‘NO’ and an unknown ‘input’ are shown. Is the ‘input’ = ‘Yes’ or ‘NO’?

Answer Starting from the top row and right most column, find the lowest cost D ( i,j )t : it is found to be the cell at ( i,j )=(9,9), D(9,9)=13. From the lowest cost position ( i,j )t, find the next position ( i,j )t-1 = argument_mini,j {D(i-1,j), D(i-1,j-1), D(i,j-1)}. E.g. position ( i,j )t-1 = argument_mini,j {48,12,47)} =(9-1,9-1)=(8,8) that contains “12” is selected. Repeat above until the path reaches the right most column or the lowest row. Note: argument_min_i,j {cell1, cell2, cell3} means the argument i,j of the cell with the lowest value is selected.

Thank you ^_~