PRESENTATION ON DEEPFAKES B.E (ISE)-VI Semester-B SEC 2020
OUTLINES DEEPFAKES NEURAL NETWORKS ANALYZING THE TECHNOLOGY PROCESS APPLICATION 3D HEADPOSE ESTIMATION INCONSISTENT HEAD POSES IN DEEP FAKES CLASSIFICATION BASED ON HEAD POSES CONCLUSION
Do you believe everything you see? Well, what you see is not always what you see!
SAY HELLO TO DEEPFAKES!
DEEPFAKES The word deepfake has been around only for a couple years. It is a combination of “deep learning” – which is a subset of AI that uses neural networks – and “fake.” It is a technique for human image synthesis that is used to combine and superimpose existing images and videos onto source images or videos using a machine learning technique known generative adversarial network. The phrase deep fakes was coined in 2017. The term is named for a Reddit user known as deep fake who, in December 2017, used deep fake technology to edit the faces of celebrities onto people in pornographic video clips. These videos and audios look and sound just like the real thing. Deep fakes are lies disguised to look like truth.
NEURAL NETWORKS A deep neural network is a concept of deep learning and it is what artificial intelligence researchers call computer systems that have been trained to do specific tasks, in this case, recognize altered images. These networks are organized in connected layers. Deep neural network architecture can identify manipulated images at the pixel level with high precision . These neural networks are also used in our snapchat and Instagram filters.
Analyzing The Technology Here are the specific parameters for defining what constitutes a successful deepfake. The following criteria have been defined to evaluate these requirements: Number of images Lighting conditions Size/quality of the source material Angle of the source material Differing facial structures Overlapping objects
PROCESS At the moment there are two main applications used to create deep fakes: FakeApp and faceswap . It requires three steps: extraction , training and creation . Extraction The deep- in deep fakes comes from the fact that this face-swap technology uses Deep Learning . It often requires large amounts of data. Without hundreds of face pictures or some videos, you will not be able to create a deepfake video. A way to get around this is to collect a number of video clips which feature the people you want to face-swap. The extraction process refers to the process of extracting all frames from these video clips, identifying the faces and aligning them. The alignment is critical, since the neural network that performs the face swap requires all faces to have the same size (usually 256×256 pixels) and features aligned. Detecting and aligning faces is a problem that is considered mostly solved, and is done by most applications very efficiently.
Training Training is a technical term borrowed from Machine Learning. In this case, it refers to the process which allows a neural network to convert a face into another. Although it takes several hours, the training phase needs to be done only once. Once completed, it can convert a face from person A into person B. This is the most obscure part of the entire process.
Creation Once the training is complete, it is finally time to create a deepfake. Starting from a video, all frames are extracted and all faces are aligned. Then, each one is converted using the trained neural network. The final step is to merge the converted face back into the original frame. While this sounds like an easy task, it is actually where most face-swap applications go wrong. The creation process is the only one which does not use any Machine Learning. This is a phase where most of the mistakes are detected. Also, each frame is processed independently; there is no temporal correlation between them, meaning that the final video might have some flickering.
EXAMPLES Obama Deepfake Jordan Peele In, this deepfake, a false image or video seems deceptively real, by the American actor and director Jordan Peele, shows former US president Barack Obama speaking about the dangers of false information and fake news. Jordan Peele transferred his own facial movements to Obama’s facial characteristics using deepfake technology. Mark Zuckerberg Deepfake This particular deepfake manipulates the audio to make Facebook CEO Zuckerberg sound like a psychopath talking to CBS News about the "truth of Facebook and who really owns the future." This video was widely circulated on Instagram and ultimately went viral.
APPLICATIONS FAKE APP In January 2018, a proprietary desktop application called Fake App was launched. The app allows users to easily create and share videos with faces swapped. The app uses an artificial neural network, a GPU, and three to four gigabytes of storage space to generate the fake video. For detailed information, the program needs a lot of visual material from the person to be inserted in order to learn which image aspects have to be exchanged, based on the video sequences and images. FACE SWAP When applied correctly, this technique is uncannily good at swapping faces. But it has a major disadvantage: it only works on pre-existing pictures. It relies on neural networks, computational models that are loosely inspired by the way real brains process information. This novel technique allows generating so-called deepfakes, which actually morph a person’s face to mimic someone else’s features, although preserving the original facial expression.
3D HEAD POSE ESTIMATION The 3D head pose corresponds to the rotation and translation of the world coordinates to the corresponding camera coordinates. Specifically, denote [ U,V,W ] T as the world coordinates of one facial landmark, [ X,Y,Z ] T be its camera coordinates, and ( x,y ) T be its image coordinates. The transformation between the world and the camera coordinate systems can be formulated as where R is the 3 × 3 rotation matrix, ~t is 3 × 1 translation vector. The transformation between camera and image coordinate systems is defined as where f x and f y are the focal lengths in the x - and y directions and ( c x ,c y ) is the optical center, and s is an unknown scaling factor.
In 3D head pose estimation, we need to solve the reverse problem, i.e , estimating s , R and ~t using the 2D image coordinates and 3D world coordinates of the same set of facial landmarks obtained from a standard model, e.g , a 3D average face model, assuming we know the camera parameter. Specifically, for a set of n facial landmark points, this can be formulated as an optimization problem, as that can be solved efficiently using the Levenberg-Marquardt algorithm [15]. The estimated R is the camera pose which is the rotation of the camera with regards to the world coordinate, and the head pose is obtained by reversing it as R T (as R is an orthornormal matrix).
INCONSISTENT HEAD POSES IN DEEP FAKES As a result of swapping faces in the central face region in the Deep Fake process in Fig. 1, the landmark locations of fake faces often deviate from those of the original faces. As shown in Fig. 1(c), a landmark in the central face region P0 is firstly affine-transformed into P0 in = MP0. After the generative neural network, its corresponding landmark on the faked face is Q0 out. As the configuration of the generative neural network in Deep Fake does not guarantee landmark matching, and people have different facial structures, this landmark Q0 out on generated face could have different locations to P0 in. Based on the comparing 51 central region landmarks of 795 pairs of images in 64 × 64 pixels, the mean shifting of a landmark from the input (Fig. 1(d)) to the output (Fig. 1(e)) of the generative neural network is 1.540 pixels, and its standard deviation is 0.921 pixel. After an inverse transformation Q0 = M−1Q0 out, the landmark locations Q0 in the faked faces will differ from the corresponding landmarks P0 in the original face..
Fig. 1 Distribution of the cosine distance between ~ v c and ~ v a for fake and real face images
CLASSIFICATION BASED ON HEAD POSES We further trained SVM classifiers based on the differences between head poses estimated using the full set of facial landmarks and those in the central face regions to differentiate Deep Fakes from real images or videos. The features are extracted in following procedures: (1) For each image or video frame, we run a face detector and extract 68 facial landmarks using software package DLib [16]. (2) Then, with the standard 3D facial landmark model of the same 68 points from OpenFace2 [17], the head poses from central face region ( Rc and tc ) and whole face (Ra and ta) are estimated with landmarks 18 − 36,49,55 (red in Fig. 2) and 1 − 36,49,55 (red and blue in Fig. 2), respectively. Here, we approximate the camera focal length as the image width, camera center as image center, and ignore the effect of lens distortion. (3) The differences between the obtained rotation matrices (Ra − Rc ) and translation vectors are flattened into a vector, which is standardized by subtracting its mean and divided by its standard deviation for classification.
Fig. 2. ROC curves of the SVM classification results, see texts for details. Experimental evaluations of our methods on a set of real face images and Deep Fakes.
CONCLUSION In this paper, we propose a new method to expose AIgenerated fake face images or videos (commonly known as the Deep Fakes ). Our method is based on observations that such Deep Fakes are created by splicing a synthesized face region into the original image, and in doing so, introducing errors that can be revealed when 3D head poses are estimated from the face images. We perform experiments to demonstrate this phenomenon and further develop a classification