978-1-6654-7886-1/22/$31.00 ©2022 IEEE
De
ep Learning Based Speaker Recognition
System with CNN and LSTM Techniques
Noshin Nirvana Prachi
Electrical and Computer Engineering
North South University, Bangladesh
[email protected]
Faisal Mahmud Nahiyan
Electrical and Computer Engineering
North South University, Bangladesh
[email protected]
Md. Habibullah
Electrical and Computer Engineering
North South University, Bangladesh
[email protected]
Riasat Khan
Electrical and Computer Engineering
North South University, Bangladesh
[email protected]
Abstract—Speaker recognition is an advanced method to
identify a person from the biometric characteristics of speaking
voice samples. Speaker recognition has become a vastly popular
and useful research subject with countless essential applications in
security, assistance, replication, authentication, automation, and
verification. Many techniques are implemented using deep
learning and neural network concepts and various datasets for
speaker verification and identification. The primary goal of this
work is to create improved robust techniques of speaker
recognition to identify audio and enhance accuracy to human
levels of comprehension. TIMIT and LibriSpeech datasets are
used in this paper to develop an efficient automatic speaker
recognition system. This work focuses on using MFCC to
transform audio to spectrograms without losing the essential
features of the audio file in question. We have used a closed set
and an open set implementation procedure on these datasets. The
closed set implementation uses a standard machine learning
convention of utilizing the same datasets for training and testing,
leading to higher accuracy. On the other hand, the open set
implementation uses one dataset to train and another to test on
each occasion. The accuracy, in this case, turned out to be
relatively lower. On each dataset, CNN and LSTM deep learning
techniques have been used to identify the sound, leading to the
observation that implementing CNN resulted in a more significant
accuracy.
Keywords—convolutional neural network, deep learning, long
short-term memory, mel-frequency cepstral coefficient, speaker
recognition, speaker identification.
I. INTRODUCTION
Sound has been a key indicator of identification from the
very start of humankind. Humans learned to differentiate
between different sounds to distinguish danger, weather,
enemy, and surrounding environment [1]. And thus, humans
have grown an inbuilt capability to identify the source of the
sound based on the characteristics of the sounds [2]. We can
identify various conditions constructed from the voices by this
inbuilt ability given by evolution. Classification of sounds to
recognize speakers has been a significant field of research in
recent times. Speaker recognition can be used in
bioinformatics, scam identification, copyright verification, and
search and rescue [3]. To classify the speaker based on the
voice, the audio signal has to be preprocessed, the feature has
to be extracted, and finally, the speaker will be verified and
classified. The audio samples are preprocessed into various
segments to extract essential features. Mel-frequency Cepstral
Coefficient (MFCC), spectral flux, chroma vector, poly
features, and other features for audio classification are among
the most sorted out features. We compared two neural network
models, the long short-term memory (LSTM) [4] and the
convolutional neural network (CNN) [5] based on the MFCC
feature, to determine which one performed better in speaker
recognition. The goal of MFCC feature extraction technique is
to transform time-domain signals into frequency dependent
signals and utilize Mel filters to simulate the human cochlea,
which consists of more low-frequency filters for deep sound
and lower numbers of filters for higher pitch sound [6].
Consequently, it is reasonable to deduce that the feature
MFCC and its characteristics are centered on the ability of the
human hearing scheme to listen to the audible frequencies of
sound, which can handle the dynamic nature of true-life sounds
with the way feature vectors for classification are treated. The
datasets we are using in this work for speaker recognition are
TIMIT and LibriSpeech datasets. It compares the classification
done by CNN and LSTM using similar characteristics and
identical datasets to determine which models provide a more
accurate result in recognizing the speaker from the given
dataset. The LibriSpeech dataset is highly optimized and
requires less preprocessing than the TIMIT dataset; the
methodology and techniques used are described in the latter
part of the paper.
People have been trying to use more and more efficient
techniques to recognize speakers from the sounds, which has
been proven quite accurate. For instance, an automatic
architecture constructed on a system of deep neural networks
was presented for distant voice recognition in [7]. The
experiments were carried out with a variety of assignments,
open-source datasets, and acoustic settings. The proposed deep
neural network effectively and correctly exploits total
communication between a speech enhancement and a speech
recognizer, resulting in marginally better performance than that
obtained with more traditional distant speech recognition
(DSR) pipelines. DSR is a critical technique for creating more
flexible and effective human-machine interfaces. Next, a deep
2022 Interdisciplinary Research in Technology and Management (IRTM) | 978-1-6654-7886-1/22/$31.00 ©2022 IEEE | DOI: 10.1109/IRTM54583.2022.9791766
Authorized licensed use limited to: Instituto Tecnologico Metropolitano ITM. Downloaded on November 20,2022 at 08:34:39 UTC from IEEE Xplore. Restrictions apply.