“Stream loss”: ConvNet learning for face verification using unlabeled videos in the wild
ElahehRashedi
121 views
45 slides
Nov 27, 2018
Slide 1 of 45
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
About This Presentation
Face recognition tasks have seen a significantly improved performance due to ConvNets. However, less attention has been given to face verification from videos. This paper makes two contributions along these lines. First, we propose a method, called stream loss, for learning ConvNets using unlabeled ...
Face recognition tasks have seen a significantly improved performance due to ConvNets. However, less attention has been given to face verification from videos. This paper makes two contributions along these lines. First, we propose a method, called stream loss, for learning ConvNets using unlabeled videos in the wild. Second, we present an approach for generating a face verification dataset from videos in which the labeled streams can be created automatically without human annotation intervention. Using this approach, we have assembled a widely scalable dataset, FaceSequence, which includes 1.5M streams capturing ∼ 500K individuals. Using this dataset, we trained our network to minimize the stream loss. The network achieves accuracy comparable to the state-of-the-art on the LFW and YTF datasets with much smaller model complexity. We also fine-tuned the network using the IJB-A dataset. The validation results show competitive accuracy compared with the best previous video face verification results.
Size: 4.1 MB
Language: en
Added: Nov 27, 2018
Slides: 45 pages
Slide Content
Learning Convolutional Neural Network for Face Verification Presented By: Elaheh Rashedi PhD in Computer Science Wayne State University 2018 Advisor: Professor Xue -wen Chen
Contents Introduction Long-Term Face Tracking using ConvNet “ FaceSequence ”: Video dataset for Face Recognition “ Stream-Loss ”: ConvNet Learning for Face Verification Conclusion & Future Work
Introduction Background Convolutional Neural Network (ConvNet) ConvNet-based Face Verification ConvNet -based Face Recognition Models Two-step verification model Single-step verification model Train and Test dataset Challenges Our Contributions
Convolutional Neural Network A kind of neural network where the input is image Contains less fully connectivity between neurons Fig 1. General ConvNet structure in face recognition problems
ConvNet-based Face Verification Common steps: Face detection Viola-Jones, Cascade CNN, … Pre-processing Geometric & lighting normalization ConvNet training Supervised vs. unsupervised Face identification Classification task Metric learning Joint-Bayesian, Cosine similarity, Triplet Similarity, Energy-based similarity, … Face Verification
ConvNet -based Face Verification Models Two-step verification models Two frameworks for identification and verification DeepFace model Web-scaled DeepFace model DeepID model series Single step verification models Same framework for identification and verification FaceNet model ( GoogleNet ) VGG model
Train and Test dataset Table 1. The common face recognition datasets
Challenges Few video-based trainable Convolutional Neural Network models are proposed Lack of available public video training dataset The existent long-term face tracking algorithm has low accuracy Face tracking algorithms can be utilized to collect the training video dataset
Our Contributions Designing a video -based face verification model using ConvNet
Contents Introduction Long-Term Face Tracking using ConvNet “ FaceSequence ”: Video dataset for Face Recognition “ Stream-Loss ”: ConvNet Learning for Face Verification Conclusion & Future Work
Long-Term Face Tracking using ConvNet Common Long Term Tracking Algorithms Tracking Challenges Proposed Model Detection-Verification-Tracking model (DVT) Deep-Learning-based Face Detection ConvNet -based Face Verification Multi-patch based Face Tracking DVT System Framework Demonstration Results
Common Long Term Tracking Algorithms Common Tracking Steps Select a video Employ a bounding box around the target Distinguish the object from the background Track the object around the same region in next frame Fig 2. Tracking schema using bounding box
Tracking Challenges Can be challenging on real world noisy videos Not robust against Appearance changes Occlusion Fast motion Illumination changes Background clutter Sensitive to the initialization of target Not able to handle all situations Long term tracking challenge: Not reliable in cases where the object leaves the view
Detection-Verification-Tracking model (DVT) Model Detection-Verification-Tracking Goal Long term face tracking Wild video target (unconstrained environment) Includes 3 components: Deep learning based face detection ConvNet -based face verification Multi-patch based tracking
Deep-Learning-based Face Detection Model Cascade-CNN ( ConvNet -based detection model) ConvNet structure: 3 ConvNets for faces vs. non-faces (binary classification) 3 ConvNets for bounding box calibration (Multiclass classification) Fig 3. Cascade-CNN face detection for binary classification
ConvNet -based Face Verification Pre-trained network based on VGG MatConvNet Convolutional Neural Network: 37 layers Feature vector dimension: 4098 Fig 4. Proposed Verification steps
Multi-patch based Face Tracking Employs Multiple patches around the target Categorize patches to reliable/non-reliable categories Track reliable patches Ignore non-reliable patches Result is the average of reliable patches Fig 5. Multi-Patch tracking
DVT System Framework Fig 6. Flowchart of the proposed Long-term face tracking method, DVT
Demonstration Fig 7. Demonstration of the DVT system for pausing the video and selecting the target face to be tracked
Fig 13. Demonstration of the DVT tracking results
Fig 8. An example of DVT output sequence.
Results Implemented by Matlab R2015b MatConvNet GUI implemented in Java Threshold Similarity threshold: 0.75 Skip time: 3s Running Time Video Duration *2 Table 2. The comparison between TLD, Face-TLD and the proposed DVT method in terms of precision and recall the sitcom IT-Crowd (first series, first episode).
Contents Introduction Long-Term Face Tracking using ConvNet “ FaceSequence ”: Video dataset for Face Recognition “ Stream-Loss ”: ConvNet Learning for Face Verification Conclusion & Future Work
“ FaceSequence ”: Video dataset for Face Recognition Stream Collection & Labeling A highly automated strategy Based on long-term face tracking Using noisy videos collected from web FaceSequence Statistics Stream Samples FaceSequence Advantages
Stream Collection & Labeling Steps: Video Collection Face videos are collected from the web Videos are curated to control biases in: ethnicity gender age Original Target selection ( ) Employing a face detection algorithm Negative sample selection ( ) Detecting other faces from the same frame as Positive sample stream selection ( ) Deploying face tracking algorithm Tracking for a specific time period
FaceSequence Statistics Table 3. Characteristics of FaceSequence dataset, including total number of collected videos, number of streams extracted from videos, and number of frames per each stream.
Stream Samples Fig 9. A sample of streams of frames available in FaceSequence dataset for 5 identities .
FaceSequence Advantages Contain streams of frames Extracted from noisy videos Retains higher similarity per subject Stream are stills from 1 second of video Images in a stream more similar in terms of: Background Lighting resolution Widely scalable : In public face datasets: labeled celebrity photos are crawled from the web challenging to assemble millions of individuals In private face datasets: human annotators are involved to expand the data costly and time consuming In FaceSequence , streams are automatically labeled no human interaction is in the labelling loop expandable
Contents Introduction Long-Term Face Tracking using ConvNet “ FaceSequence ”: Video dataset for Face Recognition “ Stream-Loss ”: ConvNet Learning for Face Verification Conclusion & Future Work
“Stream-Loss”: ConvNet Learning for Face Verification ConvNet -based Face Recognition Methods Loss Learning Approaches Video-based ConvNet Models Stream-based ConvNet Learning Method Proposed Architecture Design Stream Loss Learning Experimental Results LFW and YTF Datasets IJB-A Video Dataset
Loss Learning Approaches Contrastive loss Based on the distance between two objects Triplet loss Based on the distance between three objects Multiple loss Based on the distance between multiple objects Fig 10. Triplet loss Fig 11. multiple loss
Video-based ConvNet Models Adapting an image-based ConvNet to videos Mapping each face image into a single feature vector Using a ConvNet Aggregating all feature vectors into a single one Using an aggregation function (average, max, …) Examples: Neural aggregation Network (NAN) Input aggregated Network Pros: Simple architecture Cons: Each frame is treated like a still image The temporal relation between frames is ignored Fig 12. NAN architecture for video face recognition .
Stream-based ConvNet Learning Method We proposed: A novel video-based ConvNet architecture Training inputs are stream of videos (rather than images) Designed for face verification A video-based loss learning approach Named stream-loss
Proposed Architecture Design Fig 13. The architecture of the stream-based ConvNet for video face recognition .
Proposed Architecture Design ( cont …) Fig 14. Proposed flowchart for face verification.
Stream Loss Learning Set of input images Set of output feature vectors L2 Norm distance between pairs of yo, yp, yn
Stream Loss Learning ( cont …) smooth-max and smooth-min function
Stream Loss Learning ( cont …) Stream-loss function
Experimental Results Train the network FaceSequence Dataset Test on image dataset: LFW Dataset includes 13,233 images from 5,749 different identities, YTF Dataset includes 3,425 videos from 1,595 different identities . Test on video dataset IJB-A video data includes 5,397 images and 2,042 videos from 500 identities
Experiments on LFW and YTF Datasets Table 4. Comparison of Verification Performance of Different Methods on the LFW and YTF Datasets.
Experiments on IJB-A Video Dataset Table 5. Performance comparison on the IJB-A dataset. TAR/FAR: True/False Acceptance Rate for verification. The TAR of our method at FAR=0.01 reduces the error of VGG by 67% which demonstrates a significant improvement.
Contents Introduction Long-Term Face Tracking using ConvNet “ FaceSequence ”: Video dataset for Face Recognition “ Stream-Loss ”: ConvNet Learning for Face Verification Conclusion & Future Work
Conclusion
Future Work Feeding streams of negative examples into ConvNet (instead of only one negative example): Improve the loss learning procedure Design a new stream loss function Introducing a new noise layer into the proposed ConvNet Incorporating a modification signal to the stream-loss function to calculate the statistics of label noise. Adapting the network to the noisiness nature of the generated dataset Steps: Train the ConvNet to clean noisy annotations in the large dataset (e.g. FaceSequence ) using clean labels from the same domain Fine-tune the network using both the clean labels and the full dataset with reduced noise.