YOLO_review.pptxThis is a test document that is used to satisfy the requirement to upload two documents.

gaojinming318 29 views 14 slides Mar 12, 2025
Slide 1
Slide 1 of 14
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14

About This Presentation

This is a test document that is used to satisfy the requirement to upload two documents.


Slide Content

YOLO object detection review This document aims to claim popular object detection algorithm-YOLO.

0.Background- From cv tasks to object detection-YOLO Object detection: Locate BBox and classfication; Input: image feature map or hand craft faeture; output: object bbox and label methods: one(two)-stage object detection,anchor-Free etc ; Semantic segmentation: Classify each pixel semantic and classfication; Input: image feature map;output: segmentation label; methods: FCN,dilated Convolution etc; Instance segmentation: Classify different instance in each categories; Input: object detection preliminary result; output: pixel segmentation mask; methods: mask-rcnn etc. YOLO: This document mainly to illustrate YOLOV1,this paper solve the inference speed problem of two-stage detection,and find a good balance between inference speed and accuracy.

Abstract Background: Prior work on object detection repurposes classifier to perform detection; Method: This paper proposed a kind of method that consider object detection as a regression probelm to spatially separated bounding boxs and associated class probabilities. This is a single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance,it is very quck; Result: This architecture can handle real time image by 45 frames per-second, it out-performance better than other framewok, such as DPM and RCNN.

1.Introduction What is the repurposes classifier method in object detection? One-stage method: DPM handle the silding window approach where the classifier is run at evenly spaced location over entire image ; Two-stage method: RCNN is a kind of region proposal and classfication network structure, high accuracy &roubstness but inference speed slow; What does currently detector do ? For example,R-CNN use region proposal method generate potenital boundaries in images,and run classifier at proposal box, after then post-processing is used to refined the boundingbox,eliminate duplicate detections, and rescore the box based on other objects in the scene; What does YOLO do ? YOLO reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probalities, a single convolutional netwrok simultaneously predicts multiple bounding boxes and class probabilities predicts for those boxes.; What’s the YOLO property? Firstly: fast;secondly: reasons globally about the image when making predictions; thirdly: learns generalizable representations of objects.

2.Unified Detection Unify:Previous object detector use silding window and region proposal-based techniques, but YOLO unify the separate components of object detection into a single neural network; Image processing-grid-object: This system divides the input image into S*S grid, center of object decides responsible for detecting that object; Detection :Grid-object confidence: Each grid predicts predict BBox and confidence(confidence of box contains object) score( ) about these boxes; Bounding box consists of 5 predictions: x,y,w,h and confidence,(x,y)indicates box center,w,h indicates image width and height; Each grid predicts C conditional class probabilities, Predict one set of class probabilities per grid cell, regardless of the number of boxes B. Unify detection predict calculation: ,here this equation indicates: each target category multiplied by the probability of the current target, and then multiplied the confidence probability of the boundingbox.  

2.1.D esign Network architecture: This network inspired by GoogLeNet model for iamge classification, include 24 convolutional layers by 2 fully connected labyers,1*1 reduction layers followed by 3*3 convoluntional layers; Here, output is a 7*7*30 tensor of predictions.

2.2.Training Pretraining: ImageNet 1000-class competition dataset was trained on 20 convlutional layers followed by a average-pooling layer and a fully connected layer about approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe framework; Training progress showed that adding both convolutional and connected layers to pretained networks can improve performance.What’s more this paper add four convoluntional layers and two fully connected layers with randomly initialized weights.Detection often requires fine-grained visual information so we increase the input resolution of the network from 224*224 to 448*448; Final predicts both class probabilities and bounding box coordinates,and Normalize the bounding box width and height by the image width and height so that they fall between 0 and 1.This paper parametrize the bounding box x and coordinates to be offsets of a particular grid cell location os they are also bounded between 0 and 1.

2.2.1.Training loss This paper use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation: ; This paper optimize for sum-squared error in the output of our model. Use sum-squared error beacuse it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision; Increas the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes thta don’t contain objects.use two parameters, to accomplish this, and set value as 5; Sum.squared error also equally weights errors in large boxes and small boxes.Error metric should reflect that small deviations in large boxes matter less than in small boxes. YOLO predicts multiple bounding boxes per grid cell.At training time only want one bounding box predictor to be responsible for each object.Assign one predictor to be ‘responsible’ for predicting an object based on which prediction has the hightest current IOU with the ground truth.This leads to specialization between the bounding box predictors.Each predictor gets better at prediction certain size,aspect ratios,or classes of object,improving overall recall.  

2.3.Inference Just like training, predicting detections for a test image only requires one network evaluation.On VOC the network predicts 98 bounding boxes per image and class probilities for each box.YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods; The grid design enforces spatial diversity in the bounding box predictions.Often it is clasr which grid cell an object fall in to and the network only predicts one box for each object.However, some large object near the border of multiple cells can be well localized by multiple cells.Nox-maximal suppression can be used to fix these multiple detections.While not critical to performance as it is for R-CNN or DPM,non-maximal suppression adds 2%-3% in mAP;  

2.4.Limitation of YOLO YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class, this spatial constraint limits the number of nearby objects model can predict.This model struggles with small objects that appear in groups, such as flocks of birds; This model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusal aspect ratios or predicting bounding boxes since our architectures has multiple downsampling layers from the input image; Finally, while train on a loss function thta approximates detection performance, loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU;

3.Comparison to Other Detection Systems Object detection is a core problem in computer vision.Detection pipelines generally start by extracting a set of robust features from input image,then classifiers or localizers are used to identify object in the feature space.These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image,compare the YOLO detection system to several top detection frameworks,highlighting key similarities
Tags