Phd Presentaiton on Facial Expresson recogntion - Final.pptx

shahzaddar 8 views 53 slides Aug 21, 2024
Slide 1
Slide 1 of 53
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53

About This Presentation

PhD presentation on facial expression recognition


Slide Content

Intelligent Facial Expression Recognition The Superior University Lahore Presented by Hafiz Muhammad Shahzad Roll No. PHCS-S19-003 Session: 2019–2023 Supervisor: Dr. Sohail Massood Bhatti Co-Supervisor: Prof. Dr. Arfan Jaffar (Final Thesis Defense) 1

Outline 1 - Introduction 2 - Motivation 3 - Application 4 - Literature Review 5 - Problem Statement 6 - Research Questions 7 - Research Objectives 8 - Dataset 9 - Proposed Methodology 10 - Results 11 - Summary 12 - Conclusion and Future Work 13 - References 2

1- Introduction Facial Emotion Recognition (FER): Essential for communication, social interactions, health, and security purposes [1]. The Significance of FER: Reflections of Personalities, Sentiments, Goals, and Intents. Facial expressions are essential for non-verbal communication, conveying emotions, and understanding others' feelings. They play a crucial role in social interactions, building connections, and expressing empathy in various personal and professional settings [2]. 3

2-Motivation Growing importance in human-computer interaction and emotion analysis. Real-world applications in virtual reality, marketing, and user interfaces. Technical challenges stimulate innovation in computer vision and machine learning.` Social impact on communication for individuals with impaired social interaction. 4

3- Application Facial expression recognition has wide-ranging applications such as Health Monitoring : FER can be used to monitor patients' pain levels, detect signs of mental health disorders, and improve overall healthcare. Human Behaviour Analysis: FER can be used to analyse human behaviour, including temperament, criminal proclivities, and lying. User Experience: FER can be used to create more engaging and immersive user experiences in gaming, entertainment, and human-computer interaction. Security Purposes: FER can also be used in security purposes to help determine a person's emotional state or intent Cross-Cultural Communication: FER can be used to facilitate cross-cultural communication by enabling individuals to understand the emotions of people from different cultures and backgrounds [4,5]. Fraud Detection : FER can be used to identify signs of fraud, such as micro-expressions indicating dishonesty. 5

4- Literature Review Mask-wearing during a pandemic makes emotion recognition challenging without full facial visibility [6]. Different facial regions are linked to specific emotions, with the mouth being important for happiness, surprise, sadness, disgust, and anger, and the eyes crucial for recognizing fear [7]. It has been discovered that the lower part of the face is better at recognizing happy expressions, but the upper portion of the face seems to be crucial in recognizing afraid and sad faces [8]. Emotion recognition accuracy decreased significantly for masked faces compared to unmasked faces. Overall emotion-recognition accuracy declined from 69.9% for unmasked to 48.9% for masked target faces [9]. Facial expression recognition is hindered when facial features are concealed, with the mouth and eyes being crucial for interpretation [10]. 6

4- Literature Review … Facial expression accuracy declines significantly with occlusion or masks, but adding extra features and using multimodal approaches with CNN can increase accuracy [11]. Multimodality combines information from multiple sensors to improve problem-solving by extracting and combining important features [12-13], resulting in enhanced performance and comprehensive representation in the predicted output. Multimodal fusion, or data fusion, combines information from different sources using effective methods to make accurate and reliable decisions. It is widely used in health, environment, diagnostics, and aerospace engineering [14]. 7

5- Problem Statement Facial expression recognition has become more accurate in recent years. However, the COVID-19 pandemic has introduced a new barrier in the form of mandatory face masks. Masks have drastically impacted the recognition of facial expressions by concealing many crucial facial features. Masks hide up to 60% of facial expressions, which has drastically impacted the accuracy of FER. 8

6- Research Questions 1- How does the presence of masks affect the accuracy of facial expression recognition when using single model techniques? 2- Does integrating multiple modalities, such as facial expression recognition and voice emotions, improve facial recognition accuracy for masked faces? 3- Which fusion method is most effective for integrating voice and facial expression recognition datasets in multimodal technique to enhance the accuracy? 9

7- Research Objectives Investigate the masked dataset of facial expressions for MLF-W-FER. Develop a multimodal neural network combining features from masked facial expressions and voice expressions datasets. Enhance the accuracy of recognizing masked facial expressions by applying different fusion techniques. 10

8- Datasets M-LFW-FER dataset 1-Masked Labelled Faced in the Wild Facial Expression Recognition LFW-FER:  LFW dataset annotated manually for facial expression recognition study. M-LFW_FER:  LFW dataset processed by automatic wearing face mask method for masked facial expression recognition study. The M-LFW-FER [16] dataset is a collection of 9825 images of faces wearing masks (5194 positive, 776 negative, 3347 neutral) and the testing dataset contains 1155 images (676 positive, 96 negative, 441 neutral).  LFW-FER dataset 11

8- Datasets … 2- Voice Dataset 1- CREMA-D(Crowd-Sourced Emotional Multimodal Actors Dataset) The CREMA-D (Crowd-Sourced Emotional Multimodal Actors Database) dataset is a large-scale dataset that is used to train and test emotion recognition models. The dataset contains audio recordings of people expressing seven emotions: anger, disgust, fear, happiness, sadness, surprise, and neutral. CREMA-D includes 7,442 clips from 91 actors and actresses, of diverse ages and ethnicity [17]. 2- TESS (Toronto Emotional Speech Set) The Toronto Emotional Speech Set (TESS) [18] is a speech dataset consisting of 4048 audio files. There are 7 emotions of humans that have been considered for classification including happy, angry, sad, neutral, fearful, disgust, and surprised. 12

9- Proposed Methodology (1) First research question: Employed unimodal architecture for feature extraction from pre-trained CNNs (VGG-16 and AlexNet). Evaluated various machine learning (ML) algorithms for classification of the masked dataset. Objective: Demonstrate the efficiency of replacing classifiers on each layer of the DCNN (FC6, FC7, and FC8) with other ML classifiers.( Classifiers list ) 13

9- Proposed Methodology (1) … Algorithm: Feature extraction using a trained CNN (AlexNet and VGG-16) and applying different classifiers.   1. Load the dataset (MLF-FER-M dataset) 2. Load pre-trained CNN on (AlexNet and VGG-16 model) 3. Divide the image sets into training and testing data (70% training and 30% for testing) 4. Extract FC6, FC7, and FC8 features from the activation functions 5. Train all classifiers (which are available in ML) on the MLF-FER-M dataset in FC6, FC7, and FC8 6. Predict the class of the masked dataset for the test set using the trained classifier. E.g., positive, negative, and neutral 7. Show the average accuracy of each classifier 8. Compile the findings using a confusion matrix of the highest accuracy 14

9- Proposed Methodology (1) … Transfer learning from AlexNet for extracting features and apply different classifiers techniques Instead of creating a CNN (AlexNet) from scratch for this study, pre-trained network can be used to extract features from a variety of image types. 15

9- Proposed Methodology (1) … Transfer learning from VGG-16 for extracting features and apply different classifiers techniques Pre-trained network (VGG-16) is used to extract features from a masked dataset (MLF-W-FER). 16

10. Results (1) Accuracy calculated using the principles of error-correcting output codes (ECOC) [19] for multiclass classification with Support Vector Machines (SVMs). Utilized multiple binary SVM for prediction. Outputs of binary SVM classifiers combined to determine the final class label. MLF-W-FER overall accuracy: 55.0% and 56.73% on AlexNet and VGG-16 pretrained models. AlexNet Accuracy FC6 55.00% FC7 55.35% FC8 55.64% VGG-16 Accuracy FC6 56.85% FC7 55.19% FC8 56.73% MLF-W-FER accuracies by individual layers of Fully connected layers: 17

10. Results (1) … On each extracted features of FC6,FC7 and FC8 from AlexNet, we applied different classifiers and get the result in Table 1 . AlexNet Classifiers Accuracy FC6 Subspace Discriminant 62.7% FC7 Quadratic SVM 61.7% FC8 Quadratic SVM 63.4% Improved Accuracy of AlexNet pre-trained model on FC6, FC7, and FC8 with ML classifiers on the MLF-W-FER dataset 18

10. Results (1) … 19 Class Precision Recall F1-Score Negative 0.14 0.29 0.71 Neutral .50 .56 .44 Positive .78 .68 .32 Accuracy: 62.7 Class Precision Recall F1-Score Negative 0. 06 0.40 .60 Neutral .55 .51 .53 Positive .73 .68 .71 Accuracy: 61.7 Class Precision Recall F1-Score Negative 0.05 0.43 0.08 Neutral 0.51 0.56 0.53 Positive 0.80 0.67 0.73 Accuracy: 63.4

10. Results (1) … On each extracted features of FC6,FC7 and FC8 from VGG-16, we applied different classifiers and get the result in Table 2 . VGG-16 Classifiers Accuracy FC6 Quadratic SVM 65.5% FC7 Quadratic SVM 62.4% FC8 Median Gaussian SVM 60.3% Improved Accuracy of VGG-16 pre-trained model on FC6, FC7, and FC8 with ML classifiers on the MLF-W-FER dataset 20

10. Results (1) … 21 Class Precision Recall F1-Score Negative 0. 12 0.51 0.19 Neutral .57 .58 0.57 Positive .79 0.70 0.74 Accuracy: 65.5 Class Precision Recall F1-Score Negative 0. 08 0.49 .14 Neutral .052 0.54 .53 Positive 0.77 0.67 .72 Accuracy:62.4 Class Precision Recall F1-Score Negative 0.06 0.47 0.11 Neutral 0.47 0.51 0.49 Positive 0.77 0.65 0.71 Accuracy: 60.3

11. Summary (1) MLF-W-FER overall accuracy: 55.0% - 56.73% on AlexNet and VGG-16 pretrained models. ML algorithms applied as classifiers on each layer. SVM outperformed other classifiers, notably SoftMax, improving accuracy on M-MLFW-FER dataset by 4% to 9%. Experimental results in Table demonstrate the effectiveness of the proposed technique. AlexNet Accuracy With other Classifiers Accuracy FC6 55.00% Subspace Discriminant 62.7% FC7 55.35% Quadratic SVM 61.7% FC8 55.64% Quadratic SVM 63.4% VGG-16 Accuracy With other Classifiers Accuracy FC6 56.85% Quadratic SVM 65.5% FC7 55.19% Quadratic SVM 62.4% FC8 56.73% Median Gaussian SVM 60.3% 22

9- Proposed Methodology (2) Second experiment: Utilizes both facial and vocal expressions as modalities and trained on a comprehensive dataset containing facial and voice data. Dataset includes two standard datasets: M-LFW for masked faces & CREMA-D and TESS for voice expressions Data heterogeneity: Different formats - images for facial expressions and audio for vocal expressions. Necessary measures taken to make the data homogeneous. 23

9- Proposed Methodology (2) … Pre-Processing and Data Augmentation The TESS dataset consists of 7 categories of voice expressions: happy, neutral, disgust, sad, surprise, and fear. The CREMA-D dataset consists of 6 categories of voice expressions: anger, disgust, fear, happy, neutral, and sad. We changed the names of the two categories, happy as positive and angry as negative, to make them consistent with the M-LFW-FER dataset. Neutral category name will remain same. The name of the neutral category remained unchanged. 24

9- Proposed Methodology (2) … Recordings of TESS and CREMA-D datasets available in WAV format. Final step in pre-processing of voice dataset: Convert WAV files into spectrograms of images. TESS Dataset CREMA-D Dataset 25

9- Proposed Methodology (2) … Spectrogram files generated and important features Multimodal technique used, combining datasets: M-LFW-FER (Masked) and TESS (voice spectrogram), and M-LFW-FER and CREMA-D (categories of expressions). Equalization of datasets: Applied augmentation rule in the TESS dataset to match the number of voice expressions with the M-LFW-FER dataset. 26

9- Proposed Methodology (2) … Our proposed model consists of two types of layers. The First layers consist of convolution layers and the second one is fully connected layers for both inputs of visual and audio expression. The entire architecture of the proposed model is depicted in Figure Figure: Flowchart of the proposed multimodal method 27

9- Proposed Methodology (2) … Figure: Multimodal Architecture of MLF-W-FER & CREMAD 28

10- Results (2) Examined MLF-W-FER dataset on three architectures: VGG-16, ResNet-50, and EfficientNetV2M. Testing accuracy achieved: VGG-16: 55.6%, ResNet-50: 56.8% and EfficientNetV2M: 49% Testing accuracy on the mask dataset was not good due to the mask covering most of the facial features used in emotions, leading to performance drop on CNN and other machine learning methods. T able. Accuracies MLF-W-FER datase t Model Accuracy VGG-16 55.6 ResNet-50 56.8 EfficientNetV2M 49 29

10- Results (2) … As far as the CREMA-D and TESS datasets are concerned, some authors have evaluated results on them by applying neural network and they achieved testing accuracy 55.01% [20] for CREMA-D and 97.15% [21] for TESS respectively as reflected in Table. T able. Accuracies of TESS and CREMA-D dataset Model Datasets Accuracy VGG16 TESS 97.15 CNN CREMA-D 55.01% 30

10- Results (2) … Proposed model evaluated using multi-modal technique in different experiments. First experiment: MLF-W-FER with TESS dataset. Achieved accuracy: 99.92% Second experiment: MLF-W-FER with CREMA-D dataset. Achieved accuracy: 75.67%. Multi-modal accuracy higher than single model technique for the masked dataset. 31 Class Precision Recall F1-Score Negative 1.00 1.00 1.00 Neutral 1.00 1.00 1.00 Positive 1.00 0.98 0.99 F1 Score: 99.92 (MLF-W-FER & TESS) Class Precision Recall F1-Score Negative 0.816 0.75 0.78 Neutral 0.70 0.75 0.73 Positive 0.63 0.77 0.70 F1 Score: 75.67 (MLF-W-FER & CREMA-D) Dataset Accuracy MLF-W-FER & TESS 99.92% MLF-W-FER & CREMA-D 75.67%

10- Results (2) … Dataset Accuracy MLF-W-FER & TESS 99.92% MLF-W-FER & CREMA-D 75.67% T able Performance evaluation of Multimodal CNN (Proposed method) Comparison of different experiments in multimode technique 32 Multimodal experiments conducted with CREMA-D and M-LFW-FER datasets using different techniques. First experiment: Tested without regularization technique. Accuracy obtained: 68.48% Second experiment: Used regularization technique. Accuracy obtained: 71.34% Third experiment: Used dropout. Accuracy obtained: 71.6% Fourth experiment: Used dropout & regularization technique, F1 Accuracy obtained: 75.67%

11- Summary (2) Proposed multimodal technique for improved accuracy on the challenging MLF-W-FER dataset. Considered voice emotions datasets of TESS and CREMA-D for evaluation alongside masked facial expressions. Parallel use of TESS and CREMA-D datasets in the proposed multimodal architecture with MLF-W-FER faces dataset. Achieved accuracy of 75.67% on M-LFW-FER and CREMA-D datasets. Achieved accuracy of 99.92% on MLF-W-FER and TESS datasets. 33

9- Proposed Methodology (3) Evaluations on fusion techniques within the modified Xception architecture for the third research question. First technique: Data level fusion - Concatenated the data before passing it to the modified Xception architecture. Second technique: Feature level fusion - Concatenated the features within the convolutional area of the modified architecture before passing them to the neural network. Third technique: Late fusion - Passed two datasets individually to the modified architecture and concatenated them in the final neural network layer. Additionally, performed an experiment using the unimodal technique on the masked dataset with the modified Xception architecture. 34

9- Proposed Methodology (3) … Pre-Processing and Data Augmentation CREMA-D dataset: 6 expressions (happy, neutral, disgust, sad, surprise, and fear) MLFW-FER dataset: 3 expressions (positive, neutral, and negative) Selected 3 voice expression types for consistency: happy, angry, and neutral. Renamed categories in the CREMA-D dataset to match MLFW-FER names: happy as positive and angry as negative. The neutral category remains unchanged. Recordings in the CREMA-D dataset available in WAV format. P reprocessing step for the voice dataset: Convert WAV files into spectrograms of images. Data augmentation: Total number of observation of masked dataset must be equal to total no of observation in voice dataset. 35

9- Proposed Methodology (3) … The recordings of the CREMA-D datasets are available in WAV format. The final step in the pre-processing of the voice dataset is to convert the WAV files into spectrograms of images as mentioned given below. CREMA-D Dataset 36

9- Proposed Methodology (3) … Single Model Approach on Proposed Architecture The proposed unimodal architecture utilizes regularization techniques such as dropout and ridge regression to prevent overfitting and improve generalization as mentioned in Figure. MLFW dataset and CREMA-D dataset were separately passed into the architecture, and results were drawn for each dataset. Figure . Flowchart of the proposed single method 37

9- Proposed Methodology (3) … Fusion Techniques of Multi-Model Approach on Proposed Architecture 1- Data Level-Fusion (Early Fusion) approach Figure. Flowchart of the proposed multimodal method (Data Level) 38

9- Proposed Methodology (3) … Fusion Techniques of Multi-Model Approach on Proposed Architecture 2- Feature Level-Fusion (Middle Fusion) approach Figure. Flowchart of the proposed multimodal method (Feature Level Fusion) 39

9- Proposed Methodology (3) … Fusion Techniques of Multi-Model Approach on Proposed Architecture 3- Decision Level-Fusion (Late Fusion) approach Figure. Flowchart of the proposed multimodal method (Decision Level Fusion) 40

10. Results (3) Proposed architecture shows remarkable performance, surpassing other techniques significantly. Unimodal accuracy: 1- MLF-W-FER: 68.76% 2- CREMA-D dataset: 72.13% Stands out as an exceptional solution in both cases. Achieved highest accuracy on the CREMA-D dataset, showcasing superiority over existing methods (Table). Model Accuracy VGG-19 [76] 53.56 ResNet-50 56.80 MobileNet [76] 66.41 Xception (Proposed) 68.76 T able 5. 2 Accuracies for unimodal on MLF-W-FER Dataset Model Accuracy ResNet-50 53.08 CNN [ 469 ] 55.01 VGG-19 60.17 Xception (Proposed) 72.13 T able 5. 3 Accuracies for unimodal on CREMA-D Dataset 41

10. Results (3) … Conducted multimodal architecture experiments for masked facial expression recognition. Achieved 75.93% accuracy with early fusion, 79.05% with middle fusion (outperforming others), and 72.99% with late fusion. Table 5.4 shows the detailed results. Model Accuracy Decision Level Fusion 72.99 Data Level Fusion 75.93 Feature Level Fusion 79.05 T able. Performance evaluation of Multimodal CNN (Proposed method) 42

10. Results (3) … 43 Class Precision Recall F1-Score Negative 0.79 0.79 0.79 Neutral 0.73 0.68 0.70 Positive 0.67 o.86 0.75 F1 Score: 95.93 Class Precision Recall F1-Score Negative 0.78 0.87 0.82 Neutral 0.77 0.66 0.71 Positive 0.89 0.81 0.85 F1 Score: 79.05 Class Precision Recall F1-Score Negative 0.80 0.70 0.75 Neutral 0.67 0.74 0.70 Positive 0.65 0.86 0.74 F1 Score: 72.99

10. Results (3) … Feature-level fusion improved multimodal accuracy by 4% to 6% compared to other fusion techniques. Multimodal approach enhanced overall test accuracy from 70% to 79%, representing a 9% increase on MLF-W-FER and CREMA-D datasets. Study [22] by the author achieved an accuracy of 75.05% using a multimodal approach. Recent research has shown interest in exploring the impact of masks on facial and voice expressions. Architectures Type Dataset Accuracy VGG-16 [22] Unimodal MLF-W-FER 55.60 ResNet-50 [ 22 ] Unimodal MLF-W-FER 56.80 VGG-19 [16] Unimodal MLF-W-FER 60.17 Xception (Proposed) Unimodal MLF-W-FER 68.76 Mobile Net [ 16 ] Unimodal MLF-W-FER 69.71 Decision Fusion (Proposed) Multimodal MLF-W-FER & CREMA-D 72.99 MMAFER [22] Multimodal MLF-W-FER & CREMA-D 75.67 Data Fusion (Proposed) Multimodal MLF-W-FER& CREMA-D 75.93 Feature Fusion (Proposed) Multimodal MLF-W-FER & CREMA-D 79.05 T able 5. 5 Performance evaluation of Fusion method (Proposed method) 44

11. Summary (3) Various fusion techniques and single model techniques utilized for enhancement. Modified Xception architecture with dropout and regularization techniques to mitigate overfitting. Unimodal model achieved masked accuracy of 68.67%. Multimodal model achieved an accuracy of 79.05% on the masked face dataset. Multimodal model outperformed the unimodal model in terms of performance. 45

12- Conclusion and Future Work Conclusion Facial expression recognition is a challenging task due to various factors such as facial hair, glasses, or masks that can obscure expressions. Single model techniques like AlexNet and VGG-16 have limited accuracy in identifying masked faces, ranging from 55% to 57%. Feature extraction from hidden layers of these models and applying various classifiers can enhance the accuracy of facial expression recognition systems. Multimodal techniques that combine both facial expressions and voice emotions can improve the accuracy of facial recognition technology, achieving an accuracy of 75.67%. 46

12- Conclusion and Future Work … Feature fusion technique in multimodal architecture is the most effective, yielding an accuracy of 79.05%. Multimodal techniques offer new opportunities for future research and development in the field of facial expression recognition. The use of multimodal techniques can help overcome limitations of individual modalities and lead to improved accuracy. The use of multimodal techniques has the potential to enhance emotional communication, particularly in situations where facial features may be obscured, such as when individuals are wearing masks. 47

12- Conclusion and Future Work Future Work In addition to facial expressions and voice expressions, other modalities could be used to improve recognition accuracy. Body language, eye movement, and other physiological signals are potential modalities for multimodal recognition. By combining multiple modalities, a more complete picture of the emotional state can be obtained, leading to improved accuracy. 48

13- References [1] PS, Sreeja, and G. Mahalakshmi. "Emotion models: a review." International Journal of Control Theory and Applications 10, no. 8 (2017): 651-657. [2] M. Leo,P . Carcagni , P.L. Mazzeo , et al., “Analysis of facial information for healthcare applications: A survey on computer vision-based approaches,” Information, vol.11, no.3, pp. 128, 2020. [3] B. C. Ko, “A brief review of facial emotion recognition based on visual information,” sensors, vol. 18, pp. 401, 2018. [4] L. F. Barrett, R. Adolphs , S. Marsella , A. M. Martinex and S. D. Pollak, “Emotional expressions reconsidered: challenges to inferring emotion from human facial movements,” Psychological Science in the Public Interest, vol. 20, no.1, pp. 1-68, 2019. [5] M. Sajjad, M. Nasir, K. Muhammad, S. Khan, Z. Jan, et al, "Raspberry Pi assisted face recognition framework for enhanced law-enforcement services in smart cities," Future Generation Computer Systems, vol. 108, pp. 995-1007, 2020. 49

13- References… [6] M. Marini, A. Ansani , F. Paglieri , F. Caruana and M. Viola, "The impact of facemasks on emotion recognition, trust attribution and re-identification,” Scientific Reports, vol. 11, pp. 1-14, 2021. [7] K. Yinghui , Z. Ren, K. Zhang, S. Zhang, Q. Ni, and J. Han. "Lightweight facial expression recognition method based on attention mechanism and key region fusion." Journal of Electronic ImaginI vol.30, no. 6 2021. [8] V. Franzoni , G. Biondi, and A. Milani, "Emotional sounds of crowds: spectrogram-based analysis using deep learning," Multimedia tools and applications, vol. 79, pp. 36063–36075, 2020. [9]F. Grundmann , K. Epstude and S. Scheibe , “Face masks reduce emotion-recognition accuracy and perceived closeness,” Plos one, vol.16, no.4, 2021. [10] M. N. A.Tawhid , S. Siuly , H. Wang, F. Whittaker, K. Wang, and Yanchun Zhang, "A spectrogram image based intelligent technique for automatic detection of autism spectrum disorder from EEG," Plos one, vol. 6, pp. 1-20 ,2021. 50

13- References… [11] Al- Waisy , R. Qahwaji , S. Ipson and Al- Fahdawi , "A multimodal deep learning framework using local feature representations for face recognition," Machine Vision and Applications, vol. 29, pp. 35-54, 2018. [12] S.Vachmanus , A. A. Ravankar , T. Emaru and Y. Kobayashi, "Multi-modal sensor fusion-based semantic segmentation for snow driving scenarios." IEEE sensors journal, vol. 21, no. 15, pp. 16839-16851, 2021. [13] Q. Abbas, M. E. Ibrahim, and M. A. Jaffar, “A comprehensive review of recent advances on deep vision systems,” Artificial Intelligence, vol. 52, no. 1, pp. 39-76, 2019. [14]K. Wang, Y. Song, Z. Huang, Y. Sun, J. Xu, and S. Zhang. "Additive manufacturing energy consumption measurement and prediction in fabricating lattice structure based on recallable multimodal fusion network,” Measurement, vol. 196, no. 15, pp. 111215, 2022. [15]W. Sun, X. Chen, X. R. Zhang, G. Z. Dai, P. S. Chang et al., "A multi-feature learning model with enhanced local attention for vehicle re-identification," Computers, Materials & Continua, vol. 69, no. 3, pp. 3549-3560, 2021. 51

13- References… [ 16] B. Yang, W. Jianming , G. Hattori, "Facial Expression Recognition with the advent of human beings all behind face masks," Association for Computing Machinery,2020. [17]R. Pappagari , T. Wang, J. Villalba , N. Chen and N. Dehak , “X-Vectors meet emotions: A study on dependencies between emotion and speaker recognition,” in 45th Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, pp. 7169–7173, 2020. [18]P. Fuller, M. Kathleen and K. Dupuis, “Toronto emotional speech set (TESS),” Scholars Portal Dataverse, vol. 1, 2020. [19 ] J. Y. Zou, M. X.xSun , K. H. Liu, and Q. Q. Wu, “The design of dynamic ensemble selection strategy for the error-correcting output codes family,” Information Sciences, vol. 571, pp. 1-23, 2021. 52

13. References… [20] Aggarwal, Apeksha , A. Srivastava, A. Agarwal, N. Chahal et al., “Two-way feature extraction for speech emotion recognition using deep learning,” Sensors, vol. 22, no. 6, pp. 2378, 2022. [21] A. Shukla, K. Vougioukas , P. Ma, S. Petridis and M. Pantic , “Visually guided self supervised learning of speech representations,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASS), Barcelona, Spain, pp. 6299-6303, 2020. [22] H.M.Shahzad , S.M.Bhatti , Arfan Jaffar and M.Rashid , “A multi-modal deep learning approach for emotion recognition”, Intelligent Automation & Soft Computing, vol. 36, 2023. 53
Tags