Deepfake Detection with the help of AI.pptx

rudracool62 251 views 20 slides Jul 23, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Deepfake Detection, To Design and Develop a Deep Learning algorithm to classify the video as deepfake or pristine


Slide Content

B y Mamta Singh Priyanka Kumari Rohit Jaiswal Naimish Kumar Verma

Problem Statement To Design and Develop a Deep Learning algorithm to classify the video as deepfake or pristine.

Introduction What is Deepfake? Deepfake is a technique that uses deep leaming algorithms to create fake media usually by swapping a person's face/audio from a source into another person's face in a target, and also we dedect the fake vedio . These hyper-realistic digital manipulations of images, voices, and videos have the potential to create widespread misinformation, erode trust in media, and even incite political and social unrest. Deepfakes can be used to make people say things they never said, do things they never did, and create scenarios that never occurred.

The objective of fake face detection is to identify and distinguish manipulated or synthetic faces, such as those created through deepfake technology, from authentic human faces in images or videos. This is crucial for maintaining the integrity of visual content and preventing the spread of misinformation and deceptive practices. The objective of detecting fake videos is to employ advanced technology, such as deep learning algorithms and forensic analysis, to identify manipulated or fabricated content in videos. This helps maintain the integrity of visual information and combat the spread of misinformation. The objective of fake voice detection is to identify and distinguish synthetic or manipulated voices from genuine human voices. This is crucial for maintaining the integrity of voice-based systems, such as voice authentication, fraud prevention, and ensuring trustworthy communication platforms. Techniques often involve analyzing various acoustic and linguistic features to detect anomalies indicative of artificially generated or altered voices.

How TO DEDECT THE VIDEO AND FACES Check the Source: Verify the origin of the video. If it's from a reputable source, it's more likely to be authentic. Look for Inconsistencies: Watch for inconsistencies in lighting, shadows, and reflections that may indicate manipulation. Audio Analysis: Pay attention to audio quality and consistency. Mismatched or unnatural sounds could be a sign of manipulation. Frame Analysis: Analyze individual frames for anomalies, such as strange artifacts or inconsistencies that might occur during video editing. Deepfake Detection Tools: Use specialized tools designed to detect deepfakes. These tools may use AI algorithms to identify unnatural facial expressions or inconsistencies. Reverse Image/Video Search: Use reverse image or video search engines to check if the content has been used elsewhere, indicating potential manipulation.

System Architecture

Data-set Exploration

Pre-processing 5 4 3 2 1 SAVING THE FACE CROPPED VIDEO CREATING NEW FACE CROPPED VIDEO CROPING FACE FACE DETECTION SPLIT VIDEO INTO FRAMES

Model Architecture ResNext-50 1 LSTM layer with 2048 shape input vector and 2048 latent features along with 0.4 chance of dropout and ReLU Activation function Sequential Layer

Training Workflow

Prediction Workflow

RESULTS Model Name Dataset No of Videos Sequence Length Accuracy model_90_acc_20_frames_FF_data FaceForensic ++ 2000 20 90.95477387 model_95_acc_40_frames_FF_data 40 95.22613065 model_97_acc_60_frames_FF_data 60 97.48743719 model_97_acc_80_frames_FF_data 80 97.73366834 model_97_acc_100_frames_FF_data 100 97.76180905 model_84_acc_10_frames_final_data Our Dataset 6000 10 84. 662519 model_87_acc_20_frames_final_data 20 87.79160186 model_89_acc_40_frames_final_data 40 89.3468118195956 model_91_acc_60_frames_final_data 60 91.5909797822706 model_92_acc_80_frames_final_data 80 92.4981855883877 model_93_acc_100_frames_final_data 100 92.10883877

What is deep fake voice? A voice deepfake is one that closely mimics a real person's voice. The voice can accurately replicate tonality, accents, cadence, and other unique characteristics of the target person. People use AI and robust computing power to generate such voice clones or synthetic voices.

How to dedect the voice: Early audio deepfake detection mainly relied on hidden Markov chains and Gaussian mixture models, and later evolved into front-end and back-end models.   The typical audio deepfake detection system is a framework composed of a front end and back end. The front ends extract acoustic features from speech, and the back end converts features into scores. Traditional front-end feature extractors use digital signal processing algorithms to extract spectrum, phase, or other acoustic features

The whole model consists of a pre-trained HuBERT -based and back-end detection model. The input of the entire model is the original waveform, and the output is the result of binary classification. Firstly, the data were pre-processed by adding the impulse signal and white noise additive noise to the original audio for data enhancement (see  Section 3.1  for details). Next, a self-supervised pre-trained model and fine-tuning (see  Section 3.2  for more information) were used to extract acoustic features. A fully connected layer was added after the self-supervised front end to train jointly with the back-end detection model and reduce the dimensionality of the self-supervised model output. The extracted acoustic features were then processed by the three residual blocks of the back-end detection model (see  Section 3.3  for details), where α-FMS was used to obtain more discriminative features. Finally, a softmax activation function was used in the output layer to obtain real or fake detection results.  

Fine-tuning is one of the transfer learning methods suitable for smaller datasets and it has low training costs, which can improve the detection performance for known attacks. Some studies have shown that fine-tuning is beneficial and can prevent overfitting, promoting better generalization [ 14 ]. Pre-training only extracts features of natural speech, and fine-tuning, with both natural and deepfake audio data, enables the self-supervised pre-training model to adapt to the downstream task of audio deepfake detection, which helps to improve detection performance. The process of fine-tuning is shown in  Figure 2 b. After pre-training on unlabeled data, fine-tuning was performed on the two training sets with labels. The back-end detection model and the pre-trained HuBERT model were jointly optimized by back- propagationand the weighted cross-entropy loss function was used to calculate the loss.  

The original Rawnet2 model cannot fully extract the deeper features of fake audio, cannot effectively distinguish the key features of real and deepfake speech, and the generalizability of the model needs to be improved. Therefore, this study made the following improvements to the RawNet2 model: (1) A self-supervised speech pre-training model was used instead of sinc convolutional layers; (2) It had an improved residual structure with α-FMS instead of FMS; (3) The number of residual blocks were reduced. Most end-to-end speaker recognition models have degraded performance compared to models using manual features, while the widely adopted ECAPA-TDNN model and its variants [ 22 , 23 ] enable an EER below 1%. In this study, we followed the setting of the ECAPA-TDNN model and reduced the number of residual blocks from 6 to 3 to speed up the training and make the model more efficient. The structure of the improved model is shown in  Figure 3 a, and the structure of the improved residual block is shown in  Figure 3 b.