“Stream loss”: ConvNet learning for face verification using unlabeled videos in the wild

ElahehRashedi 121 views 45 slides Nov 27, 2018
Slide 1
Slide 1 of 45
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45

About This Presentation

Face recognition tasks have seen a significantly improved performance due to ConvNets. However, less attention has been given to face verification from videos. This paper makes two contributions along these lines. First, we propose a method, called stream loss, for learning ConvNets using unlabeled ...


Slide Content

Learning Convolutional Neural Network for Face Verification Presented By: Elaheh Rashedi PhD in Computer Science Wayne State University 2018 Advisor: Professor Xue -wen Chen

Contents Introduction Long-Term Face Tracking using ConvNet “ FaceSequence ”: Video dataset for Face Recognition “ Stream-Loss ”: ConvNet Learning for Face Verification Conclusion & Future Work

Introduction Background Convolutional Neural Network (ConvNet) ConvNet-based Face Verification ConvNet -based Face Recognition Models Two-step verification model Single-step verification model Train and Test dataset Challenges Our Contributions

Convolutional Neural Network A kind of neural network where the input is image Contains less fully connectivity between neurons Fig 1. General ConvNet structure in face recognition problems

ConvNet-based Face Verification Common steps: Face detection Viola-Jones, Cascade CNN, … Pre-processing Geometric & lighting normalization ConvNet training Supervised vs. unsupervised Face identification Classification task Metric learning Joint-Bayesian, Cosine similarity, Triplet Similarity, Energy-based similarity, … Face Verification

ConvNet -based Face Verification Models Two-step verification models Two frameworks for identification and verification DeepFace model Web-scaled DeepFace model DeepID model series Single step verification models Same framework for identification and verification FaceNet model ( GoogleNet ) VGG model

Train and Test dataset Table 1. The common face recognition datasets

Challenges Few video-based trainable Convolutional Neural Network models are proposed Lack of available public video training dataset The existent long-term face tracking algorithm has low accuracy Face tracking algorithms can be utilized to collect the training video dataset

Our Contributions Designing a video -based face verification model using ConvNet

Contents Introduction Long-Term Face Tracking using ConvNet “ FaceSequence ”: Video dataset for Face Recognition “ Stream-Loss ”: ConvNet Learning for Face Verification Conclusion & Future Work

Long-Term Face Tracking using ConvNet Common Long Term Tracking Algorithms Tracking Challenges Proposed Model Detection-Verification-Tracking model (DVT) Deep-Learning-based Face Detection ConvNet -based Face Verification Multi-patch based Face Tracking DVT System Framework Demonstration Results

Common Long Term Tracking Algorithms Common Tracking Steps Select a video Employ a bounding box around the target Distinguish the object from the background Track the object around the same region in next frame Fig 2. Tracking schema using bounding box

Tracking Challenges Can be challenging on real world noisy videos Not robust against Appearance changes Occlusion Fast motion Illumination changes Background clutter Sensitive to the initialization of target Not able to handle all situations Long term tracking challenge: Not reliable in cases where the object leaves the view

Detection-Verification-Tracking model (DVT) Model Detection-Verification-Tracking Goal Long term face tracking Wild video target (unconstrained environment) Includes 3 components: Deep learning based face detection ConvNet -based face verification Multi-patch based tracking

Deep-Learning-based Face Detection Model Cascade-CNN ( ConvNet -based detection model) ConvNet structure: 3 ConvNets for faces vs. non-faces (binary classification) 3 ConvNets for bounding box calibration (Multiclass classification) Fig 3. Cascade-CNN face detection for binary classification

ConvNet -based Face Verification Pre-trained network based on VGG MatConvNet Convolutional Neural Network: 37 layers Feature vector dimension: 4098 Fig 4. Proposed Verification steps

Multi-patch based Face Tracking Employs Multiple patches around the target Categorize patches to reliable/non-reliable categories Track reliable patches Ignore non-reliable patches Result is the average of reliable patches Fig 5. Multi-Patch tracking

DVT System Framework Fig 6. Flowchart of the proposed Long-term face tracking method, DVT

Demonstration Fig 7. Demonstration of the DVT system for pausing the video and selecting the target face to be tracked

Fig 13. Demonstration of the DVT tracking results

Fig 8. An example of DVT output sequence.

Results Implemented by Matlab R2015b MatConvNet GUI implemented in Java Threshold Similarity threshold: 0.75 Skip time: 3s Running Time Video Duration *2 Table 2. The comparison between TLD, Face-TLD and the proposed DVT method in terms of precision and recall the sitcom IT-Crowd (first series, first episode).

Contents Introduction Long-Term Face Tracking using ConvNet “ FaceSequence ”: Video dataset for Face Recognition “ Stream-Loss ”: ConvNet Learning for Face Verification Conclusion & Future Work

“ FaceSequence ”: Video dataset for Face Recognition Stream Collection & Labeling A highly automated strategy Based on long-term face tracking Using noisy videos collected from web FaceSequence Statistics Stream Samples FaceSequence Advantages

Stream Collection & Labeling Steps: Video Collection Face videos are collected from the web Videos are curated to control biases in: ethnicity gender age Original Target selection ( ) Employing a face detection algorithm Negative sample selection ( ) Detecting other faces from the same frame as Positive sample stream selection ( ) Deploying face tracking algorithm Tracking for a specific time period  

FaceSequence Statistics Table 3. Characteristics of FaceSequence dataset, including total number of collected videos, number of streams extracted from videos, and number of frames per each stream.

Stream Samples Fig 9. A sample of streams of frames available in FaceSequence dataset for 5 identities .

FaceSequence Advantages Contain streams of frames Extracted from noisy videos Retains higher similarity per subject Stream are stills from 1 second of video Images in a stream more similar in terms of: Background Lighting resolution Widely scalable : In public face datasets: labeled celebrity photos are crawled from the web challenging to assemble millions of individuals In private face datasets: human annotators are involved to expand the data costly and time consuming In FaceSequence , streams are automatically labeled no human interaction is in the labelling loop expandable

Contents Introduction Long-Term Face Tracking using ConvNet “ FaceSequence ”: Video dataset for Face Recognition “ Stream-Loss ”: ConvNet Learning for Face Verification Conclusion & Future Work

“Stream-Loss”: ConvNet Learning for Face Verification ConvNet -based Face Recognition Methods Loss Learning Approaches Video-based ConvNet Models Stream-based ConvNet Learning Method Proposed Architecture Design Stream Loss Learning Experimental Results LFW and YTF Datasets IJB-A Video Dataset

Loss Learning Approaches Contrastive loss Based on the distance between two objects Triplet loss Based on the distance between three objects Multiple loss Based on the distance between multiple objects Fig 10. Triplet loss Fig 11. multiple loss

Video-based ConvNet Models Adapting an image-based ConvNet to videos Mapping each face image into a single feature vector Using a ConvNet Aggregating all feature vectors into a single one Using an aggregation function (average, max, …) Examples: Neural aggregation Network (NAN) Input aggregated Network Pros: Simple architecture Cons: Each frame is treated like a still image The temporal relation between frames is ignored Fig 12. NAN architecture for video face recognition .

Stream-based ConvNet Learning Method We proposed: A novel video-based ConvNet architecture Training inputs are stream of videos (rather than images) Designed for face verification A video-based loss learning approach Named stream-loss

Proposed Architecture Design Fig 13. The architecture of the stream-based ConvNet for video face recognition .

Proposed Architecture Design ( cont …) Fig 14. Proposed flowchart for face verification.

Stream Loss Learning Set of input images Set of output feature vectors L2 Norm distance between pairs of yo, yp, yn

Stream Loss Learning ( cont …) smooth-max and smooth-min function

Stream Loss Learning ( cont …) Stream-loss function

Experimental Results Train the network FaceSequence Dataset Test on image dataset: LFW Dataset includes 13,233 images from 5,749 different identities, YTF Dataset includes 3,425 videos from 1,595 different identities . Test on video dataset IJB-A video data includes 5,397 images and 2,042 videos from 500 identities

Experiments on LFW and YTF Datasets Table 4. Comparison of Verification Performance of Different Methods on the LFW and YTF Datasets.

Experiments on IJB-A Video Dataset Table 5. Performance comparison on the IJB-A dataset. TAR/FAR: True/False Acceptance Rate for verification. The TAR of our method at FAR=0.01 reduces the error of VGG by 67% which demonstrates a significant improvement.

Contents Introduction Long-Term Face Tracking using ConvNet “ FaceSequence ”: Video dataset for Face Recognition “ Stream-Loss ”: ConvNet Learning for Face Verification Conclusion & Future Work

Conclusion

Future Work Feeding streams of negative examples into ConvNet (instead of only one negative example): Improve the loss learning procedure Design a new stream loss function Introducing a new noise layer into the proposed ConvNet Incorporating a modification signal to the stream-loss function to calculate the statistics of label noise. Adapting the network to the noisiness nature of the generated dataset Steps: Train the ConvNet to clean noisy annotations in the large dataset (e.g. FaceSequence ) using clean labels from the same domain Fine-tune the network using both the clean labels and the full dataset with reduced noise.

Thank You! Elaheh Rashedi [email protected]