Vad

Voice Activity Detection (VAD) in presence of Noise Tejus Adiga M NMAMIT , Nitte. Presented By:

Voice Activity Detection (VAD) Definition of VAD Task of locating speech segment boundaries in input signal corrupted by noise. Task of classifying the given frame as Speech and Noise frame. Problem Statement Given an input frame vector 𝑥 the VAD problem considers detecting the presence of speech in a signal which is corrupted by different kinds of noise. Assuming that the speech signals and the noise are additive, the VAD module has to decide in favor of the two hypotheses : 9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 2 𝐻0 ∶ 𝑥 = 𝑁 𝐻1 ∶ 𝑥 = 𝑆 + 𝑁 (1)

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 3 Artifacts of VAD Front End Clipping (FEC) Occurs at transition from Noise to Speech Mid Speech Clipping (MSC) Speech frame misclassified as Noise Over Clipping Occurs at transition from Speech to Noise Noise detected as Speech (NDS) High energy noise frames detected as speech

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 4 Applications of VAD Discontinuous Transmission in Speech communication Systems. Encode and transmit only speech frames. Switch off transmitter during Non-Speech frames to minimize power consumption. Example: GSM Audio Codec. Automatic Speech/Speaker Recognition. Apply recognition algorithms only on speech segments. Example: Apple’s Siri , Microsoft’s Cortana , Google Voice. Speech Encoding. Encode speech frames at high bitrate and Non Speech frames at low bitrate. Increases Compression Ratio. Example: ITU G.729 Audio codec. Speech Enhancement and Noise Reduction systems. Non-Stationary noise statistics are computed and used for future audio frames.

9 November 2015 Literature Survey - Introduction Department of Electronics and Communications, NMAMIT, Nitte. 5 Speech signal is corrupted by environmental noise additively. R esulting signal is given by Where is the clean speech signal and is the additive noise. VAD algorithms try to estimate the statistical parameters of noise and classify the given audio frame as speech or noise.

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 6 Time Domain Algorithms VAD using Short term Signal Energy Short Term Signal Energy is given by where is the Energy o f the frame, is the input audio signal. is the Window of length N samples. Training Phase: Computes the average Energy levels of Noise and store it as training data. Detection Phase: If energy level of given frame is greater than noise classify frame as Speech frame. Else noise frame. If speech frame is classified as noise then use that frame to update the training data energy level. (2) (2) Typical Window duration is 20ms. Within window of 20ms speech signal appears to be Stationary. For Audio signal sampled at 16KHz Window length is 320 samples. For 8KHz, Window length is 160 samples .

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 7 Time Domain Algorithms VAD using Zero Crossing Detector Number of zero crossings per audio frame in speech is lesser than noise. Typical Zero crossings in Speech frame of 10ms is 5 to 15. Where is the input signal frame and is the window of length N. Frame is classified as speech if is greater than threshold. (3) (3) (4) (4)

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 8 Frequency Subband Distance measure (FSDM) method Speech frames have significant Power Difference between low frequency and high frequency sub bands. Non Speech frames have Relatively Uniform power distribution. FSDM metric is given by where is the input audio frame, N is the length of the frame. FSDM feature can be improved by weighing it with the Power Envelope of (5) (5) (6) (6)

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 9 Frequency Subband Distance measure (FSDM) method Smoothened FSDM coefficients are further smoothened using Median Filter for better decision. S ort over N f frames in ascending order. The adaptive threshold is set to if where is constant, is the sorted index. (7) (7)

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 10 Long Term Spectral Flatness Measure (LSFM ) method The LSFM metric is computed over discrete Frequencies from 500 Hz to 4000 Hz as Where is the Geometric mean and is the Arithmetic mean of the Power Spectral Density (PSD) Where R is the number of frames used to compute LSFM metric. (8) (8) (9) (9) (10) (10)

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 11 Long Term Spectral Flatness Measure (LSFM ) method is the Power Spectral Density of input given by Where is the FFT of . Long Term Spectral Flatness Measure (LSFM) is computed over R frames of input signal. LSFM over R frames appears to be relatively flat for speech frames whereas non speech frames averaged over R frames has significant peaks. (11) (11) Fig 1: LSFM feature for different values of R and m

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 12 Single Frequency Filtering (SFF) Training Phase: Noise provides a floor for speech at discrete frequencies from = 300Hz to 3600Hz at interval of 20Hz. Floor Weights Where is the mean of noise frames. A quantity is computed as Where M = 64, is variance of Noise and is mean. (12) (12) (13) (13)

Then a threshold is found The Dynamic range of the signal for every frame of 300ms over 10ms shift is computed Depending on the values of the is smoothened over a smoothing length of to get . The VAD decision is done as 9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 13 Single Frequency Filtering (SFF) (14) (14) (15) (15) (16) (16) (17) (17)

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 14 Single Frequency Filtering (SFF) The input signal s is differentiated is multiplied by a complex sinusoid of normalized frequency . Where and is the sampling frequency. The signal is passed through a single pole filter whose system function is given by Where This is to ensure that filter is stable. The output of the single pole filter is (12) (12) (13) (13) (14) (14) (15) (15)

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 15 Single Frequency Filtering (SFF) From the envelope is found out at every frequency from k = 300Hz to 3600Hz. where and are the real and imaginary components of . Envelope is multiplied with Noise Floor Weights to reduce the effect of noise. Depending on the values of the is smoothened over a smoothing length of to get . The VAD decision is done as (16) (16) (16) (16) (17) (17)

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 16 Single Frequency Filtering (SFF) Fig 2: Visualization of SFF approach. (a) Clean speech. (b) Speech degraded by noise. (c) Envelope of degraded signal. (d) Weighted envelopes. (e) Envelope of clean speech signal.

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 17 Simulation of Single Frequency Filtering Generation of Test Vector for out of Band Noise (Speech + Noise) Generate White Noise in any Audio Editing Tool (Ex: Audacity). Apply High Pass Filter to White Noise with pass band greater than 4KHz. Mix Clean speech and Out of band Noise. Noise and Speech Sampled at 16kHz, 1 Channel, 32 bits floating point samples . Fig 3: Spectrum of Generated White Noise from Audacity and Out of band Noise.

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 18 Simulation of Single Frequency Filtering Simulation Environment: C++ with C++14 support and Standard Template Library (STL) with Open MP support. Microsoft Visual Studio 2013 Professional on Windows or XCode 6.4 on Mac. Audacity Audio Editor and Wave Pad Editor. Training Phase: Noise frames of duration 30 seconds are used to generate floor weights and initial threshold. Detection Phase: Input frames of duration 20ms (320 samples) is subjected to SFF VAD. If frame is declared as noise then that frame is used for updating of Noise floor weights and threshold .

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 19 Simulation Results (a) Clean Speech (b) Out of band Noise (c) Speech + Noise (e) Clipped signal guided by VAD output (d) Output of SFF VAD

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 20 Conclusion SFF VAD is effective for Out of Band Noise. Some speech frames are detected as Noise. Threshold is very sensitive in SFF VAD. Transient Noise may deviate the future VAD decisions. SFF approach already operates in Frequency domain. Hence it is easy to integrate SFF VAD as a sub module in buffer speech systems where frames are already transformed into frequency domain. Significant Computational Complexity.

9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 21 Future Work Analysis of SFF VAD for other type of noise. SFF VAD is great in avoiding VAD artifacts. Still scope for improvement. Utilization of past VAD decisions to make it robust for transient noise. Utilization of Noise reduction front end for improving VAD decisions. At low SNR environments utilizing Autocorrelation input for improving VAD decision.

[1] Jongseo Sohn , Nam Soo Kim and Wonyong Sung, “ A Statistical Model-Based Voice Activity Detection ”, IEEE Signal Processing Letters, VOL. 6, NO. 1, pp 1-4, January 1999. [2] G . Aneeja , and B. Yegnanarayana , “ Single Frequency Filtering Approach for Discriminating Speech and Nonspeech ” IEEE/ACM Transactions On Audio, Speech, And Language Processing, VOL. 23, NO. 4, pp 705-717, April 2015. [3] Oren Rosen, Saman Mousazadeh and Israel Cohen, “ Voice Activity Detection In Presence of Transient Noise Using Spectral Clustering And Diffusion Kernels ”, Proceedings of IEEE Electrical and Electronics Engineers in Israel (IEEEI), 2014. [4] Srikanth Nagisetty , Zongxian Liu, Takuya Kawashima, Hiroyuki Ehara , Xuan Zhou, Bin Wang, Zexin Liu, Lei Miao, Jon Gibbs, Lasse Laaksonen , Venkatraman Atti , Vivek Rajendran , Venkatesh Krishnan, Hosang Sung and Kihyun Choo , “ Low Bit Rate High-Quality MDCT Audio Coding Of The 3gpp Evs Standard ”, Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp 5883-5887, 2015. [5] Sreekumar K.T, Kuruvachan K. George, Arunraj K and C. Santhosh Kumar, “ Spectral Matching Based Voice Activity Detector for Improved Speaker Recognition ”, Proceedings of International Conference on Power, Signals, Controls and Computation, January 2014. [6] Wei Shi, Yuexian Zou and Yi Liu, “ Long-Term Auto-Correlation Statistics based Voice Activity Detection for Strong Noisy Speech ”, Proceedings of IEEE China Summit and International Conference on Signal and Information Processing, pp 100-104, 2014. [7] Chong Feng and Chunhui Zhao, “ Voice Activity Detection Based On Ensemble Empirical Mode Decomposition And Teager Kurtosis ”, Proceedings of International Conference on Signal Processing, pp 455-460, 2014 . 9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 22 References

[8] M . H. Moattar and M. M. Homayounpour , “ A Simple but Efficient Real-Time Voice Activity Detection Algorithm ”, Proceedings of European Signal Processing Conference, pp 2549-2553, 2009. [9] Hongzhi Wang, Yuchao Xu and Meijing Li, “ Study on the MFCC Similarity-based Voice Activity Detection Algorithm ”, Proceedings of Artificial Intelligence, Management Science and Electronic Commerse , pp 4391 – 4394, Aug. 2011. [10] Tuan V. Pham and Gernot Kubin , “ Comparison between DFT- and DWT-Based Speech/Non-speech Detection for Adverse Environments ”, Proceedings of International Conference on Advanced Technologies for Communications (ATC), pp 299-302, 2011. [11] Yanna Ma and Akinori Nishihara, “ Efficient voice activity detection algorithm using long-term spectral flatness measure ”, European Association for Signal processing Journal on Audio, Speech, and Music Processing, NO. 1, pp 1-18, 2013. [12] Michael Grimm and Kristian Kroschel , “ Robust Speech Recognition and Understanding ”, I-Tech Education and Publishing, June 2007. [13] Lawrence Rabiner and Ronald W. Schafer, “ Digital Processing of Speech Signals ”, Pearson, Fourth Edition, January 2007. 9 November 2015 Department of Electronics and Communications, NMAMIT, Nitte. 23 References

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Vad

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx