Speaker Recognition PPT 14 Aug 2024.pptx

robertkotieno2022 22 views 26 slides Aug 26, 2024
Slide 1
Slide 1 of 26
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26

About This Presentation

Speaker identification and recognition.


Slide Content

Speaker Recognition System using Hybrid of MFCC and RCNN with HCO Algorithm Optimization STEPHEN NYAKUTI ENG-319-001/2022 SUPERVISORS PROF. DR - ENG. LIVINGSTONE NGOO DR. HENRY KIRAGU July 2024

Introduction Problem Statement Research Objectives Research Questions Justification/Rationale Scope of the Research Literature Review Methodology Results Conclusions Recommendations References 1. OUTLINE

2. INTRODUCTION Speaker recognition is the process of identifying a person based on their voice characteristics . Voice characteristics specific to speakers can be captured using Mel-frequency cepstral coefficients (MFCC) MFCCs have been improved using Convolutional Neural Networks (CNN) yet difficulties in achieving high accuracies still exist ; Hybrid algorithms combining MFCC and Region-based Convolutional Neural Networks (RCNN) have been found to be promising ; Current systems still struggle with correctly recognizing speaker in noisy environments such as the National Assembly of Kenya; Speaker recognition approaches currently used for verbatim transcription in the National Assembly are manual, prone to error and inefficient ; The researcher sought to develop a robust speaker recognition method using MFCC as the acoustic features and R-CNN reinforced with hosted cuckoo optimization (HCO) algorithm to classify the features.

3. PROBLEM STATEMENT Inadequacy of existing speaker recognition methods, particularly in handling noisy environments . Traditional approaches struggle to maintain accuracy and reliability when faced with noise interference in speech signals. The limitation hampers the effectiveness of speaker recognition systems in various real-world applications This research aimed to develop a robust method that can identify speakers in both clean and noisy environments by integrating MFCC for feature extraction and R-CNN for classifying optimized with HCO algorithm

4. RESEARCH OBJECTIVES GENERAL OBJECTIVE To develop, simulate and validate speaker recognition method for verbatim transcription using spectrum subtraction and Mel-frequency cepstral coefficients SPECIFIC OBJECTIVES To develop speaker recognition method that denoises speech signals using spectrum subtraction To simulate speaker recognition method for the Verbatim Transcription To validate performance of the denoised speaker recognition method for Verbatim Transcription

5. RESEARCH QUESTIONS How is speaker recognition method that denoises speech signals using spectrum subtraction developed? How is speaker recognition method for verbatim transcription simulated? How do we validate the denoised speaker recognition system for verbatim transcription?

6. JUSTIFICATION/RATIONALE It sought to address the challenges of verbatim transcription of the plenary and committee proceedings of the National Assembly of Kenya; It sought to solve the problem of manual and often inaccurate speaker recognition currently used; It innovatively used a combination of MFCC and optimized R-CNN for speaker recognition, thus advancing knowledge in the area

7. SCOPE OF THE RESEARCH The research proposed a novel speaker recognition method based on MFCC and R-CNN optimized with HCO algorithm . The scope of the research encompasses the development, simulation and validation of performance of the proposed method.

8. LITERATURE REVIEW Author Method Finding Gap Ashar et al. (2020) Proposed a hybrid CNN-MFCC approach for speaker identification. Achieved an accuracy of 97.8%. Method outperformed the existing methods based GMs and SVM) -Researchers did not test their method on noisy or degraded speech signals ; -Their system was also not applied in real-time scenario like the parliamentary settings. Ayvaz et al. (2022) Used MFCC features through machine learning (KNN, DT, RF and ANN. Achieved an accuracy of 98.7% with ANN. The researchers claimed that their system was robust to noise and speaker variability. -Did not consider parliamentary settings did not apply to verbatim transcription ; -The researchers also failed to consider scalability and computational complexitys to inform on its use in other scenarios. Costantini et al. (2023) Developed a high-level CNN and machine learning methods for speaker recognition. The system was tested on the LibriSpeech dataset and achieved an accuracy of 99.2% with RF. System was able to handle large-scale datasets and diverse speakers. -Did not compare their system with other CNN-based methods -Did not evaluate their system on verbatim transcription tasks or parliamentary settings,

Author Method Finding Gap Moondra and Chahal (2023) Proposed an improved speaker recognition method for degraded human voice using modified-MFCC and linear predictive coding (LPC) with CNN. Achieved an accuracy of 98.9% and claimed that their system was superior to the conventional MFCC-based methods in terms of noise robustness and speaker discrimination. Did not provide any theoretical or empirical justification for their modifications to the MFCC algorithm or their choice of LPC features. Gaurav et al. (2023) Designed an efficient speaker identification framework based on Mask R- CNN classifier parameter optimized using hosted cuckoo optimization (HCO). -Optimized the parameters of Mask R-CNN using HCO, a metaheuristic algorithm inspired by the behavior of cuckoo birds. -Experiments were conducted on the VoxForge dataset and an accuracy of 99.4% was achieved Approach was applied to identification of birds and not speakers in parliamentary settings. 9. LITERATURE REVIEW CONTD

10. METHODOLOGY Speech signal Pre-processing Denoising by spectrum subtraction MFCC feature extraction Speaker modeling by R-CNN optimized with HCO algorithm Feature matching with database Decision Speaker identity known Speaker identity unknown Database Materials and Tools Materials and tools that will be needed in this research include the following: i . Computer; ii. Matlab software and license; iii. Storage disks to store the audio files; iv. Data Bundles for Connectivity v. Statistical Analysis Tool Objective 1: Development of Speaker recognition method

11. METHODOLOGY CONTD Denoising by Spectrum Subtraction A time smoothing window was applied to the noisy signal to help reduce high frequency noise and leverage the correlation between successive signal samples. Noise reduction filter was used to the smoothed noisy signal to estimate the speech signal of interest. Wiener filter was used on noisy signals to minimize the mean square error between the estimated and true signals ( Dogariu et al., 2021). Kalman filter was used on the same noisy signals to estimate the state of a dynamic system from a series of noisy measurements (Hu et al., 2020). The SNR of the estimated speech signal will be calculated as shown in Equation 3.2. Where; s is the true speech signal; is the estimated speech signal; and N is the number of samples.  

Hamming window was used to extract pitch coefficients to help reduce the signal value towards zero at the window boundary and to avoid interruption. Samples in the time domain, , will be converted into the frequency domain for each frame using Equation 3.5. Where, is the FFT of   The Discrete Fourier Translate (DFT) of each frame was computed to obtain the magnitude spectrum (Deshpande et al., 2021). The Mel-scale will be defined as shown below. Where, mf is the Mel-frequency and f is the linear frequency in Hz.   12. METHODOLOGY CONTD

The method combines the advantages of RNN, CNN and long short-term memory (LSTM) networks as feature extractors for sequential data. For R-CNN, the output vector, y, is expressed as shown in Equation 3.8. Where, is the activation function; is the transform of the weight matrix is an input vector, and is a bias vector.   HCO algorithm will be used optimize the R-CNN. HCO consists of three operators: levy flight, egg laying, and host bird selection (Gaurav et al., 2023). The levy flight can be expressed as shown in Equation 3.9. Where, is the position of the i th cuckoo at iteration t, is a scaling factor, and is a levy distribution with exponent lambda   13. METHODOLOGY CONTD

14. METHODOLOGY CONTD Egg laying is the process of generating new solutions by modifying the current solutions The egg laying can be expressed as shown in Equation 3.10. Where, is the new solution (egg) generated by the i th cuckoo; and are randomly selected solutions from the current population, and is a mutation factor.   Host bird selection is the process of selecting the best solutions from the current population and the new solutions. Where, and are the objective functions to be minimized, and and are the old and new solutions, respectively.  

Objective 2: To simulate speaker recognition method for the Verbatim Transcription Simulations experiments were performed with 50 audio files; The simulation was to show the variation of the EER, TMR and FRR against FMR for each signal and method; The comparison was based the Equal Error Rate (EER), False Rejection Rate (FRR), False Match Rate (FMR), and True Match Rate (TMR). 15. METHODOLOGY CONTD

Objective 3: To validate the denoised speaker recognition method for Verbatim Transcription Raw speech data were obtained from recordings from staff of the National Assembly of Kenya. 50 data files of the speeches were used. The speeches were checked with the transcriptions to confirm the speakers. Testing was done using one noise-free and one noisy speech signals and the proposed method. Input signal, pre-emphasized signal and windowed frame of the signals were plotted in time and frequency domains. The Mel filter banks response and extracted MFCCs were also be plotted Speaker recognition tests were carried out using noise-free signals and the three speaker recognition methods being the proposed MFCC with R-CNN optimized with HCO algorithm, MFCC-CNN and LPC-CNN. 16. METHODOLOGY CONTD

Objective 1: To develop speaker recognition method that denoises speech signals using spectrum subtraction The results shown in the figure depicts the original, noisy, and denoised audio signals. The presence of background noise reduced the periods of silence in the original signal. When Gaussian noise with an SNR of 10 was introduced, the resulting noisy signal The denoised signal, as depicted, had a lower amplitude compared to the noisy signal. An increasing phase shift with frequency implied a progressive delay in the signal correlating with frequency 17. RESULTS

This indicated a complex signal characterized by multiple frequency components with high SNR. Conversely, the noisy signal presented a much narrower power spectrum, ranging from -30dB to -70dB. This suggested that the noise within the signal was quite substantive despite the Gaussian noise SNR increasing from 1 to 20dB The Mel frequency filter bank response for th eaudio signals recoded indicated the following: High response intensity observed at low frequencies. Gradual reduction in intensity with increasing frequency. 18. RESULTS CONTD

A filter bank is a collection of multiple filters, each designed to isolate specific frequency bands from the input signal The Mel-scale was defined as shown below. mf = 2595 log_10 (1 +f/700) The Mel filter bank illustrated in the f igure shows while the filter index increased from 1 to 20 with increase in frequency index, the spectrum reduced from 1 to 0. The gradual decline of the spectrum from 1 to 0 across the filters reflects the logarithmic nature of the Mel scale, which is more aligned with human auditory perception than the linear frequency scales. 19. RESULTS CONTD

The MFCC spectral range of Audio 1 was mainly between -3 and 1, for Audio 2 between -1 and 2, Audio 3 were mainly between -4 and 1 and Audio 4 were mainly between -1 and 3. Audio 1 and Audio 3 had a lower range, potentially indicating a deeper voice quality, while Audio 2 Audio 4 exhibited a higher ranges corresponding to a lighter or more variable voice. 20. RESULTS CONTD

The performance of three algorithms—MFCC-RCNN-HCO, LPC-CNN, and MFCC-CNN—was tested on 50 audio samples using indicators The proposed MFCC-RCNN-HCO algorithm exhibited robust performance across these metrics. FRR vs. FMR : The proposed algorithm achieved the lowest FRR. As FRR decreased, FMR increased for all algorithms, but the proposed method maintained the lowest FRR . TMR vs. EER : The proposed method had the highest TMR , followed by LPC-CNN and MFCC-CNN. While it achieved the lowest EER at lower FMR levels, performance declined as FMR exceeded 40%. TMR vs. FMR : The proposed method consistently showed the highest TMR across all FMR values, indicating a superior ability to correctly identify speakers. MFCC-RCNN-HCO algorithm demonstrated high accuracy under typical conditions but showed room for improvement in maintaining consistency at higher error rates and noisy environments. 21. RESULTS CONTD Objection 3: To validate performance of the denoised speaker recognition method for Verbatim Transcription

24. CONCLUSION The proposed method excels in maintaining the lowest EER, FMR, and FRR while achieving the highest TMR. It consistently demonstrates the highest TMR even as FMR increases, showcasing a robust ability to accurately identify speakers from audio signals. A low FRR ensures reliable Hansard transcriptions, with the trade-off of increasing FMR being a common aspect of such systems. The method achieves high accuracy with a low EER and remains robust under typical parliamentary conditions with high noise levels.

25. RECOMMENDATIONS Future research can focus on mixing MFCC and LPC coefficients in defined ratios to improve the proposed method; Practice The method can be used in the Senate and National Assembly by incorporating automatic transcription to the method to recognize speakers and transcribe their speeches to text; Policies that support adoption and use of the system in National and County assemblies can be formulated to help in Hansard transcription

25. REFERENCES Ashar, A., Bhatti, M. S., & Mushtaq, U. (2020). Speaker identification using a hybrid cnn-mfcc approach. 2020 International Conference on Emerging Trends in Smart Technologies (ICETST) , 1–4 Ayvaz, U., Gürüler , H., Khan, F., Ahmed, N., Whangbo , T., & Bobomirzaevich , A. (2022). Automatic speaker recognition using mel -frequency cepstral coefficients through machine learning. CMC-Computers Materials & Continua, 71 (3). Costantini , G., Cesarini, V., & Brenna, E. (2023). High-Level CNN and Machine Learning Methods for Speaker Recognition. Sensors, 23 (7), 3461. https://www.mdpi.com/1424-8220/23/7/3461 Moondra , A., & Chahal, P. (2023). Improved Speaker Recognition for Degraded Human Voice using Modified-MFCC and LPC with CNN. International Journal of Advanced Computer Science and Applications, 14 (4). Gaurav, Bhardwaj, S., & Agarwal, R. (2023). An efficient speaker identification framework based on Mask R-CNN classifier parameter optimized using hosted cuckoo optimization (HCO). Journal of Ambient Intelligence and Humanized Computing, 14 (10), 13613–13625. https://doi.org/10.1007/s12652-022-03828-7

T H E E N D