Deepfake Detection, To Design and Develop a Deep Learning algorithm to classify the video as deepfake or pristine
Size: 11.32 MB
Language: en
Added: Jul 23, 2024
Slides: 20 pages
Slide Content
B y Mamta Singh Priyanka Kumari Rohit Jaiswal Naimish Kumar Verma
Problem Statement To Design and Develop a Deep Learning algorithm to classify the video as deepfake or pristine.
Introduction What is Deepfake? Deepfake is a technique that uses deep leaming algorithms to create fake media usually by swapping a person's face/audio from a source into another person's face in a target, and also we dedect the fake vedio . These hyper-realistic digital manipulations of images, voices, and videos have the potential to create widespread misinformation, erode trust in media, and even incite political and social unrest. Deepfakes can be used to make people say things they never said, do things they never did, and create scenarios that never occurred.
The objective of fake face detection is to identify and distinguish manipulated or synthetic faces, such as those created through deepfake technology, from authentic human faces in images or videos. This is crucial for maintaining the integrity of visual content and preventing the spread of misinformation and deceptive practices. The objective of detecting fake videos is to employ advanced technology, such as deep learning algorithms and forensic analysis, to identify manipulated or fabricated content in videos. This helps maintain the integrity of visual information and combat the spread of misinformation. The objective of fake voice detection is to identify and distinguish synthetic or manipulated voices from genuine human voices. This is crucial for maintaining the integrity of voice-based systems, such as voice authentication, fraud prevention, and ensuring trustworthy communication platforms. Techniques often involve analyzing various acoustic and linguistic features to detect anomalies indicative of artificially generated or altered voices.
How TO DEDECT THE VIDEO AND FACES Check the Source: Verify the origin of the video. If it's from a reputable source, it's more likely to be authentic. Look for Inconsistencies: Watch for inconsistencies in lighting, shadows, and reflections that may indicate manipulation. Audio Analysis: Pay attention to audio quality and consistency. Mismatched or unnatural sounds could be a sign of manipulation. Frame Analysis: Analyze individual frames for anomalies, such as strange artifacts or inconsistencies that might occur during video editing. Deepfake Detection Tools: Use specialized tools designed to detect deepfakes. These tools may use AI algorithms to identify unnatural facial expressions or inconsistencies. Reverse Image/Video Search: Use reverse image or video search engines to check if the content has been used elsewhere, indicating potential manipulation.
System Architecture
Data-set Exploration
Pre-processing 5 4 3 2 1 SAVING THE FACE CROPPED VIDEO CREATING NEW FACE CROPPED VIDEO CROPING FACE FACE DETECTION SPLIT VIDEO INTO FRAMES
Model Architecture ResNext-50 1 LSTM layer with 2048 shape input vector and 2048 latent features along with 0.4 chance of dropout and ReLU Activation function Sequential Layer
What is deep fake voice? A voice deepfake is one that closely mimics a real person's voice. The voice can accurately replicate tonality, accents, cadence, and other unique characteristics of the target person. People use AI and robust computing power to generate such voice clones or synthetic voices.
How to dedect the voice: Early audio deepfake detection mainly relied on hidden Markov chains and Gaussian mixture models, and later evolved into front-end and back-end models. The typical audio deepfake detection system is a framework composed of a front end and back end. The front ends extract acoustic features from speech, and the back end converts features into scores. Traditional front-end feature extractors use digital signal processing algorithms to extract spectrum, phase, or other acoustic features
The whole model consists of a pre-trained HuBERT -based and back-end detection model. The input of the entire model is the original waveform, and the output is the result of binary classification. Firstly, the data were pre-processed by adding the impulse signal and white noise additive noise to the original audio for data enhancement (see Section 3.1 for details). Next, a self-supervised pre-trained model and fine-tuning (see Section 3.2 for more information) were used to extract acoustic features. A fully connected layer was added after the self-supervised front end to train jointly with the back-end detection model and reduce the dimensionality of the self-supervised model output. The extracted acoustic features were then processed by the three residual blocks of the back-end detection model (see Section 3.3 for details), where α-FMS was used to obtain more discriminative features. Finally, a softmax activation function was used in the output layer to obtain real or fake detection results.
Fine-tuning is one of the transfer learning methods suitable for smaller datasets and it has low training costs, which can improve the detection performance for known attacks. Some studies have shown that fine-tuning is beneficial and can prevent overfitting, promoting better generalization [ 14 ]. Pre-training only extracts features of natural speech, and fine-tuning, with both natural and deepfake audio data, enables the self-supervised pre-training model to adapt to the downstream task of audio deepfake detection, which helps to improve detection performance. The process of fine-tuning is shown in Figure 2 b. After pre-training on unlabeled data, fine-tuning was performed on the two training sets with labels. The back-end detection model and the pre-trained HuBERT model were jointly optimized by back- propagationand the weighted cross-entropy loss function was used to calculate the loss.
The original Rawnet2 model cannot fully extract the deeper features of fake audio, cannot effectively distinguish the key features of real and deepfake speech, and the generalizability of the model needs to be improved. Therefore, this study made the following improvements to the RawNet2 model: (1) A self-supervised speech pre-training model was used instead of sinc convolutional layers; (2) It had an improved residual structure with α-FMS instead of FMS; (3) The number of residual blocks were reduced. Most end-to-end speaker recognition models have degraded performance compared to models using manual features, while the widely adopted ECAPA-TDNN model and its variants [ 22 , 23 ] enable an EER below 1%. In this study, we followed the setting of the ECAPA-TDNN model and reduced the number of residual blocks from 6 to 3 to speed up the training and make the model more efficient. The structure of the improved model is shown in Figure 3 a, and the structure of the improved residual block is shown in Figure 3 b.