16.pdfvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

yahiajaber49 60 views 7 slides Sep 01, 2025
Slide 1
Slide 1 of 7
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7

About This Presentation

z


Slide Content

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/361354561
Deep Learning Based Speaker Recognition System with CNN and LSTM
Techniques
Conference Paper · February 2022
DOI: 10.1109/IRTM54583.2022.9791766
CITATIONS
14
READS
393
4 authors:
Noshin Nirvana Prachi
Concordia University
2 PUBLICATIONS   41 CITATIONS   
SEE PROFILE
Faisal Nahiyan
North South University
1 PUBLICATION   14 CITATIONS   
SEE PROFILE
Md Habibullah
Pabna University of Science and Technology
3 PUBLICATIONS   41 CITATIONS   
SEE PROFILE
Riasat Khan
North South University
87 PUBLICATIONS   995 CITATIONS   
SEE PROFILE
All content following this page was uploaded by Riasat Khan on 31 January 2023.
The user has requested enhancement of the downloaded file.

978-1-6654-7886-1/22/$31.00 ©2022 IEEE

De
ep Learning Based Speaker Recognition
System with CNN and LSTM Techniques

Noshin Nirvana Prachi
Electrical and Computer Engineering
North South University, Bangladesh
[email protected]
Faisal Mahmud Nahiyan
Electrical and Computer Engineering
North South University, Bangladesh
[email protected]

Md. Habibullah
Electrical and Computer Engineering
North South University, Bangladesh
[email protected]

Riasat Khan
Electrical and Computer Engineering
North South University, Bangladesh
[email protected]

Abstract—Speaker recognition is an advanced method to
identify a person from the biometric characteristics of speaking
voice samples. Speaker recognition has become a vastly popular
and useful research subject with countless essential applications in
security, assistance, replication, authentication, automation, and
verification. Many techniques are implemented using deep
learning and neural network concepts and various datasets for
speaker verification and identification. The primary goal of this
work is to create improved robust techniques of speaker
recognition to identify audio and enhance accuracy to human
levels of comprehension. TIMIT and LibriSpeech datasets are
used in this paper to develop an efficient automatic speaker
recognition system. This work focuses on using MFCC to
transform audio to spectrograms without losing the essential
features of the audio file in question. We have used a closed set
and an open set implementation procedure on these datasets. The
closed set implementation uses a standard machine learning
convention of utilizing the same datasets for training and testing,
leading to higher accuracy. On the other hand, the open set
implementation uses one dataset to train and another to test on
each occasion. The accuracy, in this case, turned out to be
relatively lower. On each dataset, CNN and LSTM deep learning
techniques have been used to identify the sound, leading to the
observation that implementing CNN resulted in a more significant
accuracy.
Keywords—convolutional neural network, deep learning, long
short-term memory, mel-frequency cepstral coefficient, speaker
recognition, speaker identification.
I. INTRODUCTION
Sound has been a key indicator of identification from the
very start of humankind. Humans learned to differentiate
between different sounds to distinguish danger, weather,
enemy, and surrounding environment [1]. And thus, humans
have grown an inbuilt capability to identify the source of the
sound based on the characteristics of the sounds [2]. We can
identify various conditions constructed from the voices by this
inbuilt ability given by evolution. Classification of sounds to
recognize speakers has been a significant field of research in
recent times. Speaker recognition can be used in
bioinformatics, scam identification, copyright verification, and
search and rescue [3]. To classify the speaker based on the
voice, the audio signal has to be preprocessed, the feature has
to be extracted, and finally, the speaker will be verified and
classified. The audio samples are preprocessed into various
segments to extract essential features. Mel-frequency Cepstral
Coefficient (MFCC), spectral flux, chroma vector, poly
features, and other features for audio classification are among
the most sorted out features. We compared two neural network
models, the long short-term memory (LSTM) [4] and the
convolutional neural network (CNN) [5] based on the MFCC
feature, to determine which one performed better in speaker
recognition. The goal of MFCC feature extraction technique is
to transform time-domain signals into frequency dependent
signals and utilize Mel filters to simulate the human cochlea,
which consists of more low-frequency filters for deep sound
and lower numbers of filters for higher pitch sound [6].
Consequently, it is reasonable to deduce that the feature
MFCC and its characteristics are centered on the ability of the
human hearing scheme to listen to the audible frequencies of
sound, which can handle the dynamic nature of true-life sounds
with the way feature vectors for classification are treated. The
datasets we are using in this work for speaker recognition are
TIMIT and LibriSpeech datasets. It compares the classification
done by CNN and LSTM using similar characteristics and
identical datasets to determine which models provide a more
accurate result in recognizing the speaker from the given
dataset. The LibriSpeech dataset is highly optimized and
requires less preprocessing than the TIMIT dataset; the
methodology and techniques used are described in the latter
part of the paper.
People have been trying to use more and more efficient
techniques to recognize speakers from the sounds, which has
been proven quite accurate. For instance, an automatic
architecture constructed on a system of deep neural networks
was presented for distant voice recognition in [7]. The
experiments were carried out with a variety of assignments,
open-source datasets, and acoustic settings. The proposed deep
neural network effectively and correctly exploits total
communication between a speech enhancement and a speech
recognizer, resulting in marginally better performance than that
obtained with more traditional distant speech recognition
(DSR) pipelines. DSR is a critical technique for creating more
flexible and effective human-machine interfaces. Next, a deep
2022 Interdisciplinary Research in Technology and Management (IRTM) | 978-1-6654-7886-1/22/$31.00 ©2022 IEEE | DOI: 10.1109/IRTM54583.2022.9791766
Authorized licensed use limited to: Instituto Tecnologico Metropolitano ITM. Downloaded on November 20,2022 at 08:34:39 UTC from IEEE Xplore. Restrictions apply.

978-1-6654-7886-1/22/$31.00 ©2022 IEEE

ne
ural network approach has been utilized to address the
problem of DSR. Two open-source datasets (TIMIT and WSJ)
are used to evaluate the performance of the proposed
recognition technique.
In [8], the authors implemented SincNet, which is a neural
architecture to process audio signals directly. SincNet has been
widely utilized on performing speaker identification and
classifications. In this work, standard CNN architecture has
been employed with a set of time-domain convolution between
input waveform and FIR filters. All the elements of each filter
layer on this algorithm are learned from the input data. SincNet
uses predefined functions that depend on a few learnable
parameters. The predefined function is defined, so that
rectangular bandpass filters are employed, and the magnitude
of conventional bandpass filters can be summarized as the
differences between two low pass filters. After passing the raw
wavelength through SincNet filters, pooling, layer norm, leaky
Relu, dropouts, CNN/DNN layers, and softmax, the speaker is
identified. SincNet reduces the feature in the first convolutional
layer, and this offers the possibility to derive very selective
filters without adding any parameters to optimize the problem.
In the initial convolutional layers, SincNet feature maps are
more interpretable than other methods. SincNet and standard
CNN were compared on TIMIT and LibriSpeech datasets.
SincNet does better than other systems on both datasets. The
difference on a traditional CNN implemented by a raw
waveform is more significant on TIMIT, which shows SincNet
is better than the CNN. This difference is more insignificant
when LibriSpeech database is used and 4% improvement was
achieved with faster convergence. Standard FBANKs provide
comparable results on TIMIT, but more different and
incomparable results were obtained in LibriSpeech. The
network cannot find better filters than that of FBANKs while it
works with fewer training data.
In [9], G. Trigeorgis et al. proposed a recurrent
convolutional model for speech emotion recognition. The
implemented network has been applied to the audio voice
signal to perform a complete spontaneous emotion estimation
from speech data. To capture the emotional content of several
approaches of speaking, the acoustic features are required to be
robust enough. Conversely, the applied machine learning
techniques have to achieve an insensitive set of outliers to
model contexts. The proposed approach achieves considerably
better efficiency than the standard-designed features on the
open-source RECOLA dataset. It is a challenging job to
recognize spontaneous emotions automatically from speech.
The speech emotion classification has been managed by using
long short-term memory (LSTM) systems to perform the
comparisons. Next, convolutional neural networks (CNNs)
have been combined with LSTM to automatically learn the best
representation of the speech signal directly from the raw
samples. In this end-to-end speech emotion recognition work,
this paper showed that the proposed algorithm gets a better
result than the conventional methods based on signal
processing approaches.
In [10], S. Novoselov et al. focuses on a profound speaker
embedding model where the goal is to extract the embeddings
that are essential for a speech recognition task. The primary
focus is to yield to a text-independent speaker recognition task.
The system explains how an angular softmax activation at the
final classification layer of a neural network is preferable to
simple softmax activation. It allows the training of a more
generalized discriminative speaker embedding extractor. A
comparison in the estimation of the effectiveness of metrics is
also carried out, which led to a finding that cosine similarity is
more efficient than PLDA. Besides, it has been shown that deep
networks with residual frame-level connections tend to perform
at a higher level than more commonly used relatively shallow
architectures. Satisfactory performances can be achieved with
discriminatively trained metric learning approaches, and it is
opposed to the standard LDA PLDA method for an embedding
backend. In this work, DNN posterior-based 9-vectors are
extracted after a DNN model is trained on the switchboard
corpus using the Kaldi speech recognition network to carry out
the proposed implementation. Outputs of the DNN
corresponded to the 2,700 speech triphone states used for
statistical calculation. In addition, an 7-vector-based system
configuration has been used in this paper. The system accepts
23-dimensional MFCCs as input in lieu of raw filter-banks.
Next, the non-speech frames are filtered out by an energy-based
voice activity detector (VAD), and the speaker embeddings are
extracted from the second to final layer of the classifier system.
This configuration is referred to as 8-vectorNet. Another model
that has been implemented in this work is known as
SpeakerMaxPoolNet. It involves the development of a max-
pooling TDNN-based speaker embedding extractor where the
model is trained on short sections of utterance signals (3-10
seconds), which are arbitrarily sampled from the training data.
The backend to the Maxpooling TDNN embeddings is applied
using a simple cosine metric and an LDA-PLDA approach.
Another system known as SpeakerResNet has been used, and it
takes the same MFCC input features and uses the same
approach of data extraction and backend implementation as
SpeakerMaxPooling does. For training the proposed model,
NIST datasets from 1998-2008 has been utilized without any
data augmentation techniques. Telephone speech was collected
in the experimental setup for training.
We proposed an automatic speech recognition system using
deep learning approaches in this work. We have implemented
CNN and LSTM architectures on two different datasets, TIMIT
and LibriSpeech. The audio data were translated from a time
domain to frequency using Fourier transformation. LSTM has
a higher precision rate in the speaker classification process and
a sequential model. The CNN model was built with two
Conv2d layers, a batch normalization layer, and three dense
layers with the final layer as output. Our work consisted of the
implementation of both models in an open set and closed set
implementations and a comparison of the achieved results to
visualize the accuracy given by the models.
The rest of the paper has been presented as follows: Section
II covers a description of the applied methodologies. The model
architecture is briefly explained in Section III. Section IV
summarizes the experimental results of speaker and speech
recognition. Finally, Section V highlights our conclusions and
future research.
Authorized licensed use limited to: Instituto Tecnologico Metropolitano ITM. Downloaded on November 20,2022 at 08:34:39 UTC from IEEE Xplore. Restrictions apply.

978-1-6654-7886-1/22/$31.00 ©2022 IEEE

II
. METHODOLOGY
A. Dataset
The dataset or corpus used for training the models plays a
vital role since performance measures cannot be appropriately
compared if the testing circumstances vary too much. Several
datasets for speaker recognition tasks have been used in the
literature, such as VoxCeleb, CSS10, Ted-Lium, CHiME,
VCTK, WSJ, etc. A limited number of speakers could provide
vast speech data in some cases, while small and relevant speech
data could be obtained from a large number of speakers. In this
study, we worked with two popular SR-related datasets TIMIT
and LibriSpeech.
1) TIMIT: The TIMIT corpus consists of 630
speakers/classes, with ten samples collected from each class
[11]. The speakers belong to 8 major dialect regions of the
USA. The Train set has 462 speakers, while the test set contains
the samples from the remaining 168 speakers. When we first
analyze a closed set-like approach to the problem, we only
considered the train set, with an 8:2 split for training and testing
on both datasets we used.
However, the speakers appearing in the test set are entirely
different from the train set, making it an open set problem. The
proposed open set approach maintained the train-test
arrangement found in the original setup, with 4,620 audio files
in the train folder and 1,680 audio files in the test folder.
2) LibriSpeech: The LibriSpeech ASR corpus
(trainclean100) contains 28,539 audio samples from 251
speakers, which we have used for the closed set
implementation. However, the open set implementation
worked with a different setup of the dataset that followed the
same distribution as [12]. The training and testing materials are
chosen at random; for training and testing, 12–15 s and 2–6 s
of materials were used, respectively. The total number of
samples was 21,933, with 14,481 serving as training data and
the remainder as testing data after removing the beginning and
end silences.
B. Audio Exploratory
The samples are in .wav format, which is used for digitizing
continuous sound waves. Each file is sampled and converted
into a one-dimensional NumPy array of digital data at discrete
intervals. The displayed waves are unidirectional and can
represent any amplitude or frequency at a random given time.
Use the sample rate, the sample values can be put together to
reconstruct the speech if required [13]. The Nyquist-Shannon
theorem [13] implies that a signal's lowest sampling frequency
should be twice its highest resonant frequency to avoid
distorting the signal's underlying information. In our scenario,
we used the libROSA Python package, which synchronizes the
data such that the array only contains values ranging from −1
to 1. The default sample rate for this program is 22,050 Hz,
which reduces the array size and training time. When the mono
argument is set to true, the audio is loaded with the libROSA
library. Consequently, load function generates a mono audio
signal by merging two stereophonic channels into a one-
dimensional NumPy array. 2,048 and 512 samples are the
default frame and hop lengths, respectively [14]. Various kinds
of characteristics are retrieved from the audio once it is
represented as an array.
C. Feature Extracted (MFCC)
The audio must be translated from a time domain to a
frequency domain using Fourier transformation to extract
spectral information for the classification procedure. Several
characteristics can be used in this regard, such as Mel
Frequency Cepstral Coefficients (MFCC), Mel Spectrogram,
Spectral Centroid, Spectral Rolloff. MFCC technique has been
chosen in this work because it outperforms other methods such
as Chroma CENS, STFT, Chroma CQT, Spectral Contrast,
Tonnetz, and others to deliver precise information and effective
representations and assure higher accuracy in audio-based
classification tasks. Moreover, it is a principal feature
extraction technique in many research works that deal with
audio signals [15]. MFCCs closely mimic how humans
perceive audio signals. The libROSA library came in handy in
the feature extraction task since it has numerous built-in
functions [15] to generate the required spectrogram. The
librosa.feature.mfcc() function requires passing a few
parameters such as the imported audio and the sampling
frequency of the audio (default sampling rate of 22,050 Hz) to
generate MFCCs. The method contains 40 MFCCs across 173
frames by default unless the count of MFCCs to return has
passed through the function. The mean of the 173 frames was
used to transform the two-dimensional array into a one-
dimensional array containing the frames’ mean values. Fig. 1
illustrates the coefficients of the employed MFCC technique
with respect to time.

Fig. 1. Coefficients of the MFCC algorithm with respect to time.
III. MODEL ARCHITECTURE
In this work, two neural network models, long short-term
memory (LSTM) and the convolutional neural network (CNN),
have been used to design an automatic speaker recognition
system. The system architecture of the CNN and LSTM are
discussed briefly in the following paragraphs.
A. CNN Model
The overall architecture of a CNN network is made up of
many layers that are merged to produce the overall architecture.
To train a CNN, several types of decisions must be made
regarding architectural patterns and hyperparameters [17]. It
includes determining the input data format, the number of
convolutional and pooling layers and the filter dimension leads
to defining the learning rate, the number of epochs, dropout
Authorized licensed use limited to: Instituto Tecnologico Metropolitano ITM. Downloaded on November 20,2022 at 08:34:39 UTC from IEEE Xplore. Restrictions apply.

978-1-6654-7886-1/22/$31.00 ©2022 IEEE

[1
5] probability, and batch size [16]. In this research, we built
a sequential CNN model with two 2D convolutional layers
(Conv2D), a batch normalization layer, and three dense layers,
with the output layer as the last layer, which is demonstrated in
Fig. 2.
1) Convolutional Layers: The first Conv2D layer
consists of 64 filters, receiving input shapes of 20×5×1. When
the kernel size is set to 5, the filter matrix is multiplied by 5,
and the stride is set to 1. Consequently, the filter converges one
unit at a time around the input volume. When the padding for
the operation is set to ‘same,’ the output has the same height
and width as the input. This layer’s activation function is
selected ReLU, which is more computationally efficient than
sigmoid units [2]. The second layer Conv2D also uses the same
activation function and has the same configuration for the
kernel size, strides, and padding as the first one. However, it
consists of 128 filters, and the input shape is 20×5×64. It also
has a dropout of 30% to minimize overfitting. A
MaxPooling2D layer is used after each of the two Conv2D
layers. The MaxPooling2D layer reduces the dimension of the
input shape.
2) Batch Normalization: Since the previous layers'
parameters change during the deep neural network training
process, lower learning rates and careful parameter
initialization saturate the model with nonlinearities, increasing
the training duration. Batch normalization accelerates the
training process by reducing this issue, also known as the
internal covariate shift [3]. It allows the proposed model to be
less cautious about initialization and allows higher learning
rates.
3) Flatten Layer: The flatten layer turns the
convolutional layers' output into a one-dimensional array,
subsequently input into the following hidden layers using
global average pooling.
4) Dense Layers: One output layer and two dense layers
are used to complete the model’s processing. There are 256 and
512 nodes in each of the two dense levels. ReLU is the
nonlinear function utilized in the first two dense layers because
it decreases gradient descent time by converting all negative
activations to zero.

Fig. 2. Proposed CNN architecture.
5) Output Layer: In our closed set implementation, the
output layer had a softmax activation function consisting of
many nodes as the number of classes in training split in both
datasets. Softmax turns logits into probabilities and predicts the
choice with the highest probability. We built an intermediate
model for the open set implementation by popping out the last
softmax layer after the initial model has learned the features
and optimized weights after several epochs. Then, the
appropriate test set of the training samples are used to train the
model and extract the MFCC spectral characteristics.
Finally, the intermediate norm is used to output an
embedding vector for a test sample from each speaker. It is used
to match the other test samples of the same speaker using cosine
similarity, Euclidean distance, and Manhattan distance. Finally,
the labels are compared to determine the model’s accuracy.
B. LSTM Model
This paper uses the deep learning-based LSTM model for
speaker recognition because of its recent audio-based
classification popularity and high precision rate in the
classification process. This technique is also a sequential model
with two time-distributed layers, two LSTM layers, a flatten
layer, and another dense layer at the end. The proposed LSTM
architecture is depicted in Fig. 3.
1) LSTM Layers: There are 128 hidden units in each of
the LSTM layers. The first layer accepts an input of form 20×5,
where 20 indicates the time steps and 5 denotes the number of
times the LSTM layer should repeat itself after being applied to
the input. To avoid overfitting, the second input had a dropout
factor of 0.30.
2) Batch Normalization: This model also uses batch
normalization after the first two layers. The training time was
effectively reduced by using batch normalization.
3) Time Distributed Layers: The first time distributed
dense layers take an input size of 128 and output of 256 nodes,
which serve as the input shape for the second time distributed
layer. This second layer’s output comprises 512 nodes. In both
layers, ReLU was used as the activation function.
4) Flatten Layer: This layer flattens the 3D output from
the second time distributed layer and transfers the lengthy
vector of input data to the dense layer, which is the last layer.
5) Dense Layer: For the closed set implementation, the
dense output layer, like the CNN model, employed a softmax
activation function. The inputs were transformed into a discrete
probability distribution. The open set implementation again
used an intermediate model by popping out the last softmax
layer after a few training epochs. Following the feature
extraction from the test sets, the intermediate model produced
an embedding vector for each test class. Consequently, the
generated vector is matched with other instances of that
particular test class using cosine similarity, Manhattan distance,
and Euclidean distance.

Fig. 3. Proposed LSTM architecture.
Authorized licensed use limited to: Instituto Tecnologico Metropolitano ITM. Downloaded on November 20,2022 at 08:34:39 UTC from IEEE Xplore. Restrictions apply.

978-1-6654-7886-1/22/$31.00 ©2022 IEEE

IV
. RESULTS
In this section, the simulation results for the proposed deep
lea
rning-based speaker recognition system have been
discussed. Some important parameters of the implemented
classification model, viz. loss function and optimizer, and
performance metrics to evaluate the model have been discussed
in the following paragraphs.
1) Loss Function: In this work, we used categorical
cross-entropy loss as the loss function. This loss function is for
multi-class classification tasks, and it performs better than MSE
and RMSE since it penalizes incorrect more heavily than those.
The equation of cross-entropy loss function is expressed as:

where 1 denotes the output size.
2) Metric: The metric that we have used to measure the
performance of the proposed speaker recognition system is
accuracy. Accuracy denotes the ratio of the number of correct
predictions to the total number of predictions our model
produces, which has been depicted in (2). When multiplied by
100, this value gives us the percentage of the model’s success
in recognizing which speaker a given audio sample belongs to.
6554/25$3 . 3
0 ©IED33ep3LeDDEar3nDEigargeBs3
0
©IED33ep3derSk3nDEigargeBs3
! 100% (2)
3) Optimizer: O
ptimizers reduce the losses by fine-
tuning the attributes of a neural network, e.g., weights and
learning rates. We used the Adam optimizer, which uses the
adaptive learning rate for each parameter. It uses a combination
of the adaptive gradient algorithm (AdaGrad) and root mean
square propagation (RMSProp) to handle sparse gradients in
noisy problems.
For image-based classifications, CNN is a highly effective
approach. Because of its architecture, it can use spatial
correlation in data pictures and voices. LSTM, on the other
hand, performs better when dealing with time-series or
sequential data. In most situations, the LSTM model performs
better than the CNN model for audio classification. Testing
accuracy can increase by up to 4% after data augmentation.
However, the same did not happen in our case since we did not
opt for data augmentation.
On the TIMIT dataset, we started with 500 epochs to train
our CNN and LSTM models on closed sets. We ran the CNN
model for 300 epochs and the LSTM model for 100 epochs in
the open set implementation on the same dataset. On the
LibriSpeech dataset, we chose to use epochs for the closed set
implementation. The training and testing accuracies for the
CNN technique (closed set implementation) on the LibriSpeech
and TIMIT datasets, with the change of the number of epochs,
are shown in Fig. 4 and Fig. 5, respectively.
It can be seen that as the number of epochs rises, the model's
accuracy improves for training and testing data. We divided the
dataset into 50 batches for both models and datasets.

Fig. 4. Training and test accuracies of LibriSpeech dataset for the proposed
CN
N model.

Fig. 5. Training and test accuracies of TIMIT dataset for the proposed CNN
mod
el.
As
shown in Table I, the CNN model with the TIMIT
dataset, achieved 77.51% and 80.63% test accuracy for open
set and closed set implementations, respectively. The training
accuracy was 100% and 99.84% for the open and closed set
implementations, respectively, using the same model and
TIMIT dataset. However, the LSTM model performed poorly,
producing 62.13% and 71.54% on open set and closed set
implementations, respectively for this dataset.
TABLE I. TESTING ACCURACY OF PROPOSED MODELS
Dataset (Implementation) CNN LSTM
TIMIT (Open Set implementation) 77.51% 62.13%
TIMIT (Closed Set implementation) 80.63% 71.54%
LibriSpeech (Closed Set implementation) 97.85%
Customized LibriSpeech (Open Set
implementation)
64.70%
Customized LibriSpeech (Closed Set implementation)
84.96%


We achieved 99.26% accuracy for training and 97.85%
accuracy for testing, using the CNN model for the closed set
dataset implementation on LibriSpeech dataset. The same
model on the customized LibriSpeech dataset produced 64.70%
accuracy for the open set implementation, 99.97% accuracy for
Authorized licensed use limited to: Instituto Tecnologico Metropolitano ITM. Downloaded on November 20,2022 at 08:34:39 UTC from IEEE Xplore. Restrictions apply.

978-1-6654-7886-1/22/$31.00 ©2022 IEEE

tr
aining, and 84.96% accuracy for testing the closed set
implementation.
V. CONCLUSION
In this paper, an automatic speaker recognition system has
been implemented utilizing deep learning approaches. CNN
and LSTM techniques have been employed on the open set and
closed set of the TIMIT and LibriSpeech datasets. We have
implemented a completely different model for speaker
recognition tasks that requires a relatively more minor number
of parameters, making the training process relatively smaller
using the same architecture for two diverse datasets. We are
planning to test SincNet on other prominent speech
identification challenges, such as VoxCeleb database, in the
future. Inspired by the encouraging results of this research, we
will look at using SincNet for both supervised and unsupervised
speaker adoption. Furthermore, while our study is limited to
speaker and voice identification, we think the suggested
technology describes the fundamental paradigm of time series
processing that may be extended to various applications.
REFERENCES
[1] I. B. Fernández and M. Leszczuk, “Monitoring of audio visual quality by
key indicators,” Multimedia Tools and Applications, vol. 77, pp. 2823–
2848, 2018.
[2] A. P. Mishra, N. S. Harper and J. W. H. Schnupp, “Exploring the
distribution of statistical feature parameters for natural sound textures,”
PLOS ONE, vol. 16, pp. 1–21, 2021
[3] A. Q. Ohi, M. F. Mridha, M. A. Hamid and M. M. Monowar, “Deep
speaker recognition: Process, progress, and challenges,” IEEE Access,
vol. 9, pp. 89619–89643, 2021.
[4] B. N. Saha and A. Senapati, “Long short term memory (LSTM) based
deep learning for sentiment analysis of English and Spanish Data,”
International Conference on Computational Performance Evaluation
(ComPE), pp. 442–446, 2020.
[5] M. N. I. Suvon, R. Khan and M. Ferdous, “Real time Bangla number plate
recognition using computer vision and convolutional neural
network,” International Conference on Artificial Intelligence in
Engineering and Technology (IICAIET), pp. 1–6, 2020.
[6] Z. Wanli and L. Guoxin, “The research of feature extraction based on
MFCC for speaker recognition,” International Conference on Computer
Science and Network Technology, pp. 1074–1077, 2013.
[7] M. Ravanelli, P. Brakel, M. Omologo and Y. Bengio, “A network of deep
neural networks for distant speech recognition,” International Conference
on Acoustics, Speech and Signal Processing, pp. 1–5, 2017.
[8] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform
with sincnet,” IEEE Spoken Language Technology Workshop, pp. 1021–
1028, 2018.
[9] G. Trigeorgis et al., “Adieu features? End-to-end speech emotion
recognition using a deep convolutional recurrent network,” International
Conference on Acoustics, Speech and Signal Processing, pp. 5200–5204,
2016.
[10] S. Novoselov, A. Shulipa, I. Kremnev, A. Kozlov and V. Schemelinin,
“On deep speaker embeddings for text- independent speaker recognition,”
The Speaker and Language Recognition Workshop, pp. 378–385, 2018.
[11] A. H. Abdelaziz, “NTCD-TIMIT: A new database and baseline for noise-
robust audio-visual speech recognition,” Proc. Interspeech, pp, 3752–
3756, 2017.
[12] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, “Librispeech: An
ASR corpus based on public domain audio books,” International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.
5206–5210, 2015.
[13] Z. Song, B. Liu, Y. Pang, C. Hou and X. Li, “An improved Nyquist–
Shannon irregular sampling theorem from local averages,” IEEE
Transactions on Information Theory, vol. 58, no. 9, pp. 6093–6100, 2012.
[14] B. McFee et al., “librosa: Audio and music signal analysis in Python,”
Proceedings of the 14th Python in science conference, vol. 8, pp. 18–25,
2015.
[15] S. Davis and P. Mermelstein, “Comparison of parametric representations
for monosyllabic word recognition in continuously spoken sentences,”
IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28,
pp. 357–366, 1980.
[16] K. J. Piczak, “Environmental sound classification with convolutional
neural networks,” International Workshop on Machine Learning for
Signal Processing (MLSP), pp. 1–6, 2015.
[17] M. T. Islam, S. T. Mashfu, A. Faisal, S. C. Siam, I. T. Naheen and R.
Khan, “Deep learning-based glaucoma detection with cropped optic cup
and disc and blood vessel segmentation,” IEEE Access, vol. 10, pp. 2828–
2841, 2022.
Authorized licensed use limited to: Instituto Tecnologico Metropolitano ITM. Downloaded on November 20,2022 at 08:34:39 UTC from IEEE Xplore. Restrictions apply.
View publication stats
Tags