Weapon Detection Using Deep Learning : A Comparative Study of YOLO Architectures

khatunfetama730 0 views 53 slides Oct 06, 2025
Slide 1
Slide 1 of 53
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53

About This Presentation

sumarize and key point only


Slide Content

Weapon Detection Using Deep Learning: A
Comparative Study of YOLO Architectures
by
Examination Roll: 241133
A Project Report submitted to the
Institute of Information Technology
in partial fulfillment of the requirements for the degree of
Professional Masters in Information Technology
Supervisor: Dr. M. Mesbahuddin Sarker
Institute of Information Technology
Jahangirnagar University
Savar, Dhaka-1342
October 2025

DECLARATION
I hereby declare that this thesis is based on results obtained from my own imple-
mentation and analysis. Materials of work found by other researchers are appropri-
ately cited. This thesis, neither in whole nor in part, has been previously submitted
for any degree.
Roll:241154
ii

CERTIFICATE
The project titled “Weapon Detection Using Deep Learning: A Comparative Study
of YOLO Architectures” submitted by Abdulla Al Mamun, ID: 241133, Session:
Spring 2024, has been accepted as satisfactory in partial fulfillment of the require-
ment for the degree of Professional Master’s in Information Technology on the 11th
of October 2025.
Dr. M. Mesbahuddin Sarker
Supervisor
iii

BOARD OF EXAMINERS
Dr. M. Shamim Kaiser Coordinator
Professor, IIT, JU PMIT Coordination Committee
Dr. Risala Tasin Khan Member, PMIT Coordination Committee
Professor, IIT, JU & Director, IIT
Dr. Jesmin Akhter Member
Associate Professor, IIT, JU PMIT Coordination Committee
K M Akkas Ali Member
Associate Professor, IIT, JU PMIT Coordination Committee
Dr. Rashed Mazumder Member
Associate Professor, IIT, JU PMIT Coordination Committee
iv

ACKNOWLEDGEMENTS
All Praise be to Almighty Allah who has given me energy, patience and courage
throughout this report. My special thanks is to my supervisor Dr. M. Mesbahuddin
Sarker for his supervision, helpful comments and encouragement in every step of this
study too. Acknowledgements I would like to express my gratitude to the teachers
and administrative members of the Institute of Information Technology (IIT), Ja-
hangirnagar University, for providing a conducive academic and research ambiance.
I would also like to thank the open-source community and the producers of public
datasets whose resources were essential for this work. Special thanks are due to my
family and friends who gave me continuous moral support, patience and challenge
during the difficult time.
I am also grateful to my fellow students for their cooperation and valuable aca-
demic discussions that contributed to expanding my knowledge during this project.
The authors are grateful to the technical staff members at IIT for their valuable sup-
port in providing tools, systems, and lab facilities. I must also thank the staff of
the university library for their assistance in making necessary materials and sources
available. The different lecture series, workshops and academic courses IIT arranged
were of great importance for me in developing my approach to research and skills. I
also thank the peer reviewers and other researchers who provided useful comments
that further the development and quality of this piece. Most importantly, none of
this would have happened if IIT wasn’t a place where cooperation, mentoring and
learning was the rule, not the exception.
— Abdulla Al Mamun
v

ABSTRACT
Weapon-related incidents, particularly those involving firearms and bladed weapons
such as guns and knives, represent a pressing challenge for public safety. Manual
monitoring of surveillance streams is limited by fatigue and latency, while traditional
feature-engineered systems exhibit high false alarms under clutter, occlusion, and
low illumination. One-stage detectors such as the YOLO family offer a compelling
alternative by balancing accuracy and real-time speed.
This study presents a unified comparison of four detectors—YOLOv8n, YOLOv8s,
YOLOv8m, and YOLOv5s—on a curated Guns–Knives dataset with bounding-box
annotations. We standardize preprocessing (640×640), augmentation (flips, HSV jit-
ter, mosaic), and hyperparameters (20 epochs, patience 3) for fair evaluation. Metrics
include [email protected], [email protected]:0.95, precision, and recall. YOLOv8m yields the best
accuracy, while YOLOv8n achieves the highest efficiency for edge devices, highlighting
the practical trade-off between accuracy and compute for real-time CCTV.
Keywords:
vi

LIST OF ABBREVIATIONS
YOLO You Only Look Once
CNN Convolutional Neural Network
mAP Mean Average Precision
TPR True Positive Rate (Recall/Sensitivity)
FPR False Positive Rate
TP True Positive
TN True Negative
FP False Positive
FN False Negative
PR Precision–Recall
IoU Intersection over Union
vii

LIST OF NOTATIONS
X iis the-th sample
y
fire)
ˆ
P
TP
TP
R
TP
TP
F1 2
PR
P
mAP
curves)
viii

LIST OF FIGURES
Figure
3.1 Class distribution (training). Validation and test follow similar trends.
3.2 Class distribution (validation split). . . . . . . . . . . . . . . . . . .
3.3 Class distribution (test split). . . . . . . . . . . . . . . . . . . . . .
3.4 Split sizes (number of images). . . . . . . . . . . . . . . . . . . . . .
4.1 End-to-end study pipeline. . . . . . . . . . . . . . . . . . . . . . . .
ix

LIST OF TABLES
Table
4.1 Leaderboard of YOLO models on the Guns–Knives dataset (replace
with your Colab outputs). . . . . . . . . . . . . . . . . . . . . . . .
x

TABLE OF CONTENTS
DECLARATION
CERTIFICATE
ACKNOWLEDGEMENTS
ABSTRACT
LIST OF ABBREVIATIONS
LIST OF NOTATIONS
LIST OF FIGURES
LIST OF TABLES
CHAPTER
I. Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . .
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Dataset Standardization . . . . . . . . . . . . . . .
1.3.2 Comparative Benchmarking . . . . . . . . . . . . .
1.3.3 Efficiency Evaluation . . . . . . . . . . . . . . . . .
1.3.4 Error Analysis . . . . . . . . . . . . . . . . . . . . .
1.4 Research Goal . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Paper Organization . . . . . . . . . . . . . . . . . . . . . . .
II. Literature Review
2.1 Weapon Detection with Deep Learning . . . . . . . . . . . .
2.2 YOLO for Weapon Detection . . . . . . . . . . . . . . . . . .
2.3 YOLOv4 and Weapon Detection . . . . . . . . . . . . . . . .
2.4 Faster R-CNN vs YOLO . . . . . . . . . . . . . . . . . . . . .
xi

2.5 SSD and MobileNet for Embedded Devices . . . . . . . . . .
2.6 YOLOv4 vs YOLOv5 and YOLOv8 . . . . . . . . . . . . . .
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.1 YOLO Architectures for Weapon Detection . . . . .
2.7.2 Faster R-CNN and SSD Models . . . . . . . . . . .
2.7.3 Accuracy vs Efficiency . . . . . . . . . . . . . . . .
2.7.4 Edge Deployment and Lightweight Models . . . . .
2.7.5 Unified Benchmarking and Implementationstrconv
III. Methodology
3.1 Dataset and Splits . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Preprocessing and Augmentation . . . . . . . . . . . . . . . .
3.2.1 2.2 Horizontal Flips . . . . . . . . . . . . . . . . . .
3.2.2 HSV Jitter . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Mosaic . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Training Configuration . . . . . . . . . . . . . . . . . . . . . .
3.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . .
IV. Results and Discussion
4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Precision . . . . . . . . . . . . . . . . . . . . . . .
4.2 Confusions and Error Shape . . . . . . . . . . . . . . . . . . .
4.2.1 Small Knives . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Obstructed Guns . . . . . . . . . . . . . . . . . . .
4.2.3 False Positives and Negative Samples . . . . . . . .
4.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . .
V. Performance Analysis
5.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . .
5.2 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . .
5.3 Comparison of Accuracy vs Latency . . . . . . . . . . . . . .
VI. Conclusion and Future Work
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
References
xii

CHAPTER I
Introduction
1.1 Motivation
With the rapid progress of deep learning in the recent years, visual recognition
tasks have been significantly driven forward and many areas wither under transfor-
mation including weapon detection. Aroluxe is a deep learning company that offers
a number of applications, one of the most important in today’s society being real
time weapon detection for public safety. Conventional weapon detection technolo-
gies, such as manual surveillance based on the human eyes or rule-based systems
have several large drawbacks. These systems are often subject to operator fatigue,
slow response times, and high false alarm rates caused by problems such as visual
clutter, occluded objects and bad illumination in real-world environments like video
surveillance systems [1].
The use of weapons (gunshots, stabbings, assaults) can have devastating effects
and the early discovery of a weapon threat is paramount for minimizing costs to
human life and health, improving public safety outcomes. Consequently, there is a
growing need to develop automated weapons detection systems capable of operation
in real-time for rapid response and prompt alerts to security personnel [2]. Applying
deep learning to the task of weapon detection is a tremendous advantage compared
with conventional approaches since it largely shortens the period in which potential
threats are detected and intercepted. This is particularly useful in settings where
human oversight can be challenging due to the sheer number of video streams, or
other practical considerations.
1

With the introduction of Convolutional Neural Networks (CNN) (especially one-
stage object detection model such as YOLO) the performance and real-time require-
ments have been satisfied well, which is profitable for weapon recognition. However,
YOLO models are popular for their ability to combine feature extraction and clas-
sification in a single, unified network that is quick and accurate when dealing with
images. While classic two-step detectors propose regions of interest and then classi-
fies, YOLO does both in a simultaneous manner, which makes it order-of-magnitude
faster. This has rendered YOLO well-suited to use in many real-time systems, such
as for video cameras; where every millisecond can make a difference [3].
Ever since, YOLO has become a widely accepted computer vision system. it is
capable of inference as it provides real-time and is competitive. This is particularly
significant in the case of surveillance applications which must continuously operate
under changing conditions with massive variations on illumination, background clut-
ter and occlusions of objects [4]. In such scenarios fast and precise weapon detection.
can go a long way towards alleviating threats and offering quick response. When we
run models that are built around YOLO on the edge surrounding cameras, drone, or
IOT device-we are limited to some extent in making calculations. Such edge comput-
ing resources are typically scarce in devices, and therefore require lightweight. models
which can be both very accurate and have low latency. Indeed, there are a variety
of models such as the famous YOLOv8n which is suited to work in these settings
to be a good performer that is not affected by excessive deficit of accuracy. In the
case of server-side, when resources are sufficient (CPU and memory), larger models,
including YOLOv8m, can be utilized, which leads to improved performance at the
expense of some latency [5]. The tradeoff between accuracy and latency is important
in both of these cases. where mission scenarios favor one or other possible deployment
respectively. to the situation of operation in question. On the server-side, which does
not. will always need real time processing, more sophisticated and precise models
may be employed, and edge deployments should guarantee low-latency to counter
and respond to weapons. as quickly as possible [6]. The emergence of YOLO and
similar models as deep learning models is a paradigm shift in. weapons detection
which gives us faster, more accurate and scalable methods to find. arms in action
with the conventional means. However, there is still room to solve problems that
occur particularly great detection performance. loss under complex conditions of the
input image such as small, occlusion. object detection, and poor illumination.
2

1.2 Research Questions
The study is guided by the following research question:

YOLOv5s) on weapon detection performance under real-world surveillance con-
ditions?

tion, and false positives in weapon detection?

sumption when deploying YOLO models on edge devices and server-class envi-
ronments for weapon detection?
1.3 Objectives
In order to answer the research question about the best YOLO-based architectures
to detect weapons in real-time in surveillance systems, the following objectives are
established in this paper:
1.3.1 Dataset Standardization
The data will be combined together to create a standardized dataset of guns
and knives with standardized preprocessing methods, augmenting, and bounding-box
annotations to guarantee consistency and impartiality in model testing. Such a stan-
dardized intervention is essential because it removes dataset biases that can influence
the characteristics of the model to generalize to novel unexplored environments. An
important problem of weapon detection is the difference in real-life conditions includ-
ing occlusions, lighting conditions, and background clutter. The processed data will
contain random cropping, flipping, and rotation techniques, which will replicate the
angles and level of occlusion of weaponry at various angles [1, 5] . With a common
and stable dataset, we can guarantee that all models are tested at the same scenario
and we can make a fair comparison between different YOLO architectures. More-
over, the dataset will be structured in such a way that it is able to support real-life
requirements, which include that the small objects such as knives or guns, which
are challenging to detect based on their dimensions are represented sufficiently. This
3

method is useful in trying to overcome the complexity of a real-world surveillance
system where such environmental aspects can affect the accuracy of detection [6].
1.3.2 Comparative Benchmarking
To achieve this, the paper will critically compare four YOLO architectures namely:
YOLOv8n, YOLOv8s, YOLOv8m and YOLOv5s based on a set of standard metrics
that will enable them to compare their accuracy when it comes to weapon detection.
The main measures consist of mAP 0.5, 0.5: 0.95, precision, recall and F1-score.
Along with these, precision-recall curves and confusion matrices would be created to
have a better insight into the performance of each of the models. This benchmarking
will like be specific failure modes like false positives and false negatives, which are of
importance in surveillance applications where false detection may result in unneces-
sary security alarms [4]. In addition, a robustness study will be incorporated in this
study to determine the accommodation of each model to changes in the environmen-
tal conditions like poor illumination, occlusion and high scene clutter. YOLO models
will also be tested on dynamic and challenging video streams to gauge their resilience,
as it reflects the real world problem of surveillance systems in locations such as air-
ports or shopping malls [2]. The peculiar feature of this benchmarking is the use of
the cross-validation strategy according to which the models are trained and tested on
other subsets of data. This enables the assessment to be more inclusive of possible
model overfitting, so that the findings represent a more generalized evaluation of the
performance of each architecture [7, 8].
1.3.3 Efficiency Evaluation
This objective is about practical viability of every YOLO architecture, such as
its latency (FPS), computational load, and power consumption. Each model will be
determined by the level of its effectiveness in edge environments, where the computing
resources are frequently limited. Such models as YOLOv8n will be experimented
with regard to the capability to sustain high detection rates and work with low-power
requirements, which are characteristic of embedded devices such as CCTV cameras or
drones. The aim will also test the influence of latency and FPS on a range of real-time
demands in mission-critical systems like public safety [7, 9]. The trade-offs between
the requirement of speed and the requirement of high accuracy will be discussed, and
it is especially important in the context of situations where speed of adaptation is
necessary and high accuracy is necessary. As an example, the capability to rapidly
4

detect a weapon in a high-security environment like an airport may greatly decrease
the time of response and avert possible damages [8, 10] . The analysis will also
involve an analysis of the effect of the model complexity on computational efficiency.
Lightweight models such as YOLOv8n will be compared to heavier models such as
YOLOv8m to find out how various architectures will react when they are brought to
bear on real world deployment limitations such as limited processing power, available
memory and power consumption [9].
1.3.4 Error Analysis
Weapon detection systems should be very dependable with reduced errors like false
positives false negatives .This objective will carry out a comprehensive error analy-
sis to assess the way each YOLO model copes with high-level real-world scenarios.
The models shall be evaluated in high levels of occlusion, different light conditions
and cluttered backgrounds in weapon detection [7]. All models will be enhanced
by augmentation methods, such as hard-negative mining Such techniques aim at en-
hancing the projectile in detecting challenging or unclear instances, small knives or
partially blocked guns. Also, temporal smoothing will be included to stabilize predic-
tion among video frames, which will decrease the chances of false alarms in dynamic
environments [9, 8] . Much attention will also be paid to the comprehension and pre-
vention of the influence of edge cases-uncommon but important cases when a model
may malfunction (very low-light situations or complicated reflections). It is through
analyzing failure cases that we seek to offer practical suggestions in order to enhance
the YOLO models and establish best practices in implementing the same models in
real-life systems [3, 11].
5

1.4 Research Goal
The main aim of the study is to establish a common benchmarking infrastructure
of the YOLO-based architectures (YOLOv8n, YOLOv8s, YOLOv8m and YOLOv5s)
that can be adapted to detect weapons as part of a real-time surveillance system, with
the primary emphasis on the balancing of the detection accuracy and the computa-
tional efficiency of the system under different real-world conditions of deployment.
The framework is not only supposed to serve academic purposes but also the needs
of the society especially by improving the safety of the people by detecting weapon
threats more quickly and with greater reliability in real time. The context of the
study is unique in that, unlike previous research which have tested YOLO models
to detect objects in general object detection tasks, this study is targeted at object
detection of weapons within surveillance systems. It includes practical implementa-
tion limitations, including FPS (frames per second) and edge deployment capability,
which many other studies do not consider. This paper offers a subtle and practical
recommendation system to security professionals and guides them through multiple
YOLO models on weapon-specific datasets to ensure that they make the right choice
depending on the operational requirements.

develop a benchmarking framework where various YOLO architectures are com-
pared to each other in a systematic manner when it comes to weapon detection.
The assessment will not only involve measuring the accuracy of the detection
but also the performance in computation under the real-world constraints such
as the FPS, latency and power consumption [1, 5].

to implement YOLO models in different security scenarios, YOLOv8m will be
suggested in cases where accuracy is critical like a law enforcement server or
forensic system and greater detection accuracy is needed at the cost of some
latency. YOLOv8n will be most suitable in edge deployments such as CCTV
cameras, drones and IoT devices where there is high demand on low latency and
efficiency to make real-time decisions in resource-constrained visual computing.
YOLOv8s and YOLOv5s will be proposed as moderate compromises, which are
6

appropriate to middle-level deployments that demand a compromise between
speed and accuracy [6, 4].

academic developments to the social context, placing the emphasis on the real-
life application of the discovered solution, in which a rapid dependable and ef-
ficient weapon detection can considerably decrease the number of human losses
and increase the level of civilian safety. The research will offer a clear guide
of deployment to practitioners who will select the appropriate architecture de-
pending on the needs of real-time systems and constraints of operations [2].

real time weapon detection systems to security staff. The study aims to re-
duce the number of human casualties by directly connecting the performance of
YOLO models to real-world application in high-stakes settings such as airports,
transportation hubs, and shopping malls to allow faster response time and more
accurate threat detection [7].
7

1.5 Paper Organization
We discuss the state of the art methods of the weapon and fire/smoke detectors
in Section 2, where we make a distinction between one-stage detectors (YOLO [1]
and SSD [2]) and two-stage detectors (such as Faster R-CNN method [3]). This para-
graph provides emphasis to the importance of datasets, such as CCTV-style images,
that can provide realistic challenges, such as those of low light, occlusion, and motion
blur. In broad terms, these perspectives have the potential of significantly influenc-
ing the system performance in terms of detecting weapons [4]. We too do examine
the negative failure cases that meet such a causality principle and further subdivide
based on finer-grained causal root causes, appearance similar faces incorrectly clas-
sified as look-alikes and partial occlusions incorrectly classified based on the error of
detection. We will also draw on our previous experience like confusion object mining
and temporal reasoning modules to stabilize prediction across video frames to resolve
such problems [5, 6].
In Section 3, we specify dataset and model architecture and training procedures
thus chosen. The section describes stratified splits for handling class imbalance, and
standardized preprocessing and augmentations to make the results reproducible. For
fair comparison, we evaluate all six state-of-the-art transfer learning backbones us-
ing the same lightweight classification head [10]. This section also contains replicable
training recipes, hyperparameter tuning and calibration strategies, and ablation stud-
ies that report how augmentation strength, head configurations and backbone depth
affect the model performance.
Section 4 reports the experimental results, which focus on class-balanced accuracy
(macro-accuracy), precision, recall and F1-score. Section also contains precision-recall
curves and confusion matrices per class to assess the models. We analyze the results
of these experiments in terms of both desktop-class GPUs and edge devices, and study
how quantization and distillation affect latency and accuracy. We also study errors
and suggest error remedies, such as threshold calibration and temporal smoothing,
for the enhanced robustness of models [9, 12].
Conclusions and actions to practitioners section translates what has been learned
through experiments into something useful. It includes per-class threshold tuning,
the application of k-of-n temporal smoothing windows, and the addition of second-
8

stage verification modules to mitigate nuisance alarms. The section also elaborates
the trade-offs between quantization or distillation with properly tuned accuracy for
edge hardware deployment. It also offers domain shift robustness guidance when
addressing problems of low light, glare and motion blur. It also describes procedures
for model monitoring, drift detection and human-in-the-loop review to maintain the
continued trustworthiness of weapon detection systems. The section further discusses
on privacy-preserving techniquest, while compliance checklists GDPR-conformity are
needed to ensure that AI is responsibly used [13].
Finally, Section 6 concludes and summarizes the main contributors of this work
as well as points out related limitations, e.g., the coverage of dataset lacks, a bias to
still images and no hardware variance studies conducted that potentially affect model
generalization. The section also describes next steps, such as scaling the dataset
beyond the provided data with video-temporal annotations; investigating clip-level
inference under recurrence or transformer-based encoders; and trying to densify re-
weighting schemes in order to generate more diverse hard-negative sets. The paper
also suggests using interpretability methods like Grad-CAM/Score-CAM and multi-
modal sensor fusion incorporating audio, IoT data for improved real-time deployment
robustness and system confidence [14, 11].
9

CHAPTER II
Literature Review
2.1 Weapon Detection with Deep Learning
Detection of weapons in public safety is an increasingly important subject as a
result of the rising need for automatic surveillance. The solutions create a safer
environment by real time detection of weapons like guns and knives. Traditional
surveillance techniques of manual surveillance are prone to a myriad of issues, which
include human fatigue, error or latency. This has created a fresh interest in deep
learning models, and CNNs in particular, due to their superior ability to learn features
on raw image data when you have large amounts of it (such as object detection). There
is no exception either, in the case of weapon detection, where deep learning revolution
is making a big splash in models like You Only Look Once (YOLO), indeed a milestone
in the history of real-time WEAPON DETECTION systems. YOLO-based model is
now found to be the first choice in object detection, including weapon detection. They
have the benefit of extract feature and do classification in one forward pass, required
to operate in real time (a typical surveillance system requirement). In the following,
we discuss the contributions of different versions of YOLO and how they were used
for weapon detection.
2.2 YOLO for Weapon Detection
The YOLO (You Only Look Once) model revolutionized the mechanisms for real-
time object detection as it provided a combination of retail-speed and high precision.
the first version set up the base with a unified detection system. It showed potential,
but had difficulties finding small objects such as handguns or knives commonly found
in weapon detection. Then YOLOv2 adopted stronger backbone, the Darknet-19 and
10

reported better performance with around mAP72% [1]. This was a major improve-
ment in the capability of YOLO to detect small objects, a crucial component for
weapons detection.
YOLOv3 improved upon these merits by adding multi-scale detection. This im-
provement meant that YOLOv3 could classify objects of different sizes more accu-
rately and was thus better for the dynamic scenes. It achieved impressive performance
on the typical object detection benchmarks and has been demonstrated to be a fan-
tastic candidate for weapon detection, which is required to detect various sizes objects
from small knives to long guns[1]. Excellent performance of YOLOv3 for weapons
in complicated real life with occlusion, clutter and low illumination has popularized
them in CCTV-based system.
Another significant milestone, YOLOv4, made several main achievements such
as mosaic augmentation and self-adversarial training. These optimisations made
YOLOv4 more robust to typical problems in surveillance video, like occlusions and
bad lighting. Reported evidence, as those shown by Bhatti et al. [1]). have demon-
strated that YOLOv4 can obtain mAP scores as high as 91% for real-time weapon de-
tection in CCTV misuse footage, with the potential to eliminate false-positive alarms.
The ability of YOLOv4 to produce excellent results while detecting weapons with
distractors (as evidence from phones and tools is available on surveillance footage)
clearly visible in the scene (phones other objects that people carry on them not under
consideration for now 1) further validates its superiority.[1, 2]
2.3 YOLOv4 and Weapon Detection
The performance of YOLOv4 in weapon detection have been observed and re-
ported in the literature. As mentioned by Bhatti et al. [1], YOLOv4 and it proved to
be excellent when employed for weapons and knives detection for real-time surveil-
lance. Capacity YOLOv4 can control the fake positive more excellent. As for cluttered
and low-lightscene,LOSlike a knife was hard to detect in the scene due to its small
size, so we attempt ed to use mosaic augmentation in YOLOv4to make it more derec-
tion allseeing(????). Thus the model could identify weapons also in crowded scenes
(public locations, transport sites, airports) where fast decision-making is pivotal to
public security [2, 4].
The application of YOLOv4 has also been expanded to embedded hardware in
edge settings. An important point is that of course on the devices everything must
11

be run with constrained computational power. Edge servers on the other hand are
not always available and can be time consuming to activate. YOLOv4 has been prac-
tically deployed on edge devices that yielded incredible results in detecting weapons
instantaneously without relying on high-powered servers. These are all features that
make YOLOv4 an excellent candidate for large scale deployment in low resource en-
vironments . [6, 10]
2.4 Faster R-CNN vs YOLO
Although YOLO has demonstrated the strength of speed and real-time perfor-
mance, other models like Faster R-CNN have surpassed it in terms of detection accu-
racy. The RPN (Region Proposal Network) is utilized in Faster R-CNN, which is to
make Faster R-CNN detect object more efficiently in a very crowded surrounding. For
firearm and knife detection, the model has been demonstrated to reach mAP values
around 93%, which even surpasses performance of YOLO in terms of accuracy[9, 12].
But Faster R-CNN is computationally too expensive to be used in real-time genera-
tion. Its slow inference time and large amount of memory also make it infeasible for
embedded applications, and real-time surveillance systems that require low latency
and a fast processing speed urgently.
FLow quickly softens its potential disadvantage introducing an end-to-end train-
able architecture that reduces both speed and accuracy at the same time. Therefore,
YOLO is still the best choice for on line weapon detection applications requiring
real-time decision making [2, 13].
2.5 SSD and MobileNet for Embedded Devices
In light of growing industry interests in edge deployments, efforts on lightweight
model design such as Single Shot MultiBox Detector (SSD) and MobileNet for embed-
ded devices have received increasingly attention. The proposed models are suitable
for applications with constrained computational resources, like IoT devices or em-
bedded cameras. As for SSD, another extremely fast detector by Faster R-CNN is
speeded up at the cost of the decrease of precision. It has been used to great effect
for weapon detection in settings where response time is more important than perfect
precision [14, 11].
12

MobileNet combined with YOLO has shown promising performance for edge de-
ployments, such as weapon detection in resource-constrained environment. When
running on edge devices, such as the NVIDIA Jetson Nano, MobileNet in conjunc-
tion with YOLO makes it possible to achieve fast processing speed while maintaining
decent detection accuracy. Its lightweight setup achieves a tradeoff between real-time
capability and accuracy to ensure the optimal performance for on-site surveillance ser-
vices, where resources may be limited. But this speed comes at the cost of a marginal
decrease in overall detection performance compared to heavier models (YOLOv4)
[15, 8]
2.6 YOLOv4 vs YOLOv5 and YOLOv8
Despite there are still others preferring to use YOLOv4 for real-time weapon
detection, but with the advent of YOLOv5 and YOLOv8, it will be better than
ever. The YOLOv5, an unofficial release of the Ultralytics group, proved to be faster
and more accurate than YOLOv4 in real time for multiple tasks such as weapon
detection [16, 15]. YOLOv5 has proven to be the default of-choice for server and edge
deployment since its speed-accuracy optimizations.
The newest version, YOLOv8, is primarily made for speed and effectiveness.
YOLOv8n, YOLOv8s, YOLOv8m and YOLOv5s can cover different hardware types
so can be used in many deployment scenarios. YOLOv8n is designed for ultra-
light edge devices, while YOLOv8m has a trade-off between speed and accuracy,
thus suitable for real time surveillance system [17, 18]. Recently, YOLOv8-deep-
CNN/index.html has become one of the best weapon detection methods for dynamic
environments including airports/shopping malls/transportation centers, when real-
time processing is mandatory to guarantee public safety [19, 20, 21]
13

2.7 Summary
The real-time detection of weapons in surveillance systems is becoming more and
more relevant to address security issues in public scenarios ( airports, malls, or trans-
portation stations). Recent years have seen breakthroughs in the field due to deep
learning techniques, more specifically Convolutional Neural Networks (CNN). Among
these models, YOLO (You Only Look Once) is considered a state-of-the-art method
for real-time object detection with compelling performance in terms of speed and
accuracy. In this section we describe our literature survey on the state-of-the-art
YOLO applications for weapon detection and describe the relevant discoveries that
motivated our investigation of various YOLO networks under a common approach.
2.7.1 YOLO Architectures for Weapon Detection
The YOLO architectures have demonstrated notable performances in weapon de-
tection, especially the YOLOv3 and YOLOv4. YOLOv3 can handle multi-scale detec-
tions, which has bootstrapped a big jump in the performance of object recognition es-
pecially for small-sized arms such as guns and knives. This variant was proved capable
of dealing with occlusion and clutterabble,GEGER abble in the realistic surveillance
conditions. The same as YOLOv3 was further promoted in the following researches
like mosaic augmentation [1] and self-adversarial training [2], by which robustness of
the model can be improved under some adverse circumstances,dark environment and
object overlap.
Researches including [1].which demonstrated promise in real time weapon detec-
tion from CCTV footages. Second, our proposed YOLOv4 obtained mAP values of
approximately 91%, which can be considered a robust model for firearm and knife
detection in crowded or low light scenarios. These contributions were important for
reducing false positives and improving the detection accuracy in the interest of se-
curity in surveillance systems. The achievement of YOLOv4 on weapon detection
can attribute to its well combination between high detection accuracy and real time
speed, which is difficult for preceding methods like Faster R-CNN.
14

2.7.2 Faster R-CNN and SSD Models
Although the YOLO-lite models boast high speed performance, other network
architectures such as Faster R-CNN and SSD (Single Shot MultiBox Detector) have
shown to have better detection accuracy. Faster R-CNN equipped with RPN achieves
mAP at around 93% on firearm detection[2]. This better performance however comes
with the expense of high inference time which preclude it for real amount surveillance
systems, in particular with edge devices. Besides, the computational overhead of
Faster R-CNN is much higher that spares it finding more applications in resource
limited scenario.
On the other hand, SSD has faster testing speed than Faster R-CNN but worse
detection accuracy especially for small (including occlusion) weapons. SSD is better
suited for online application where speed matters more than the highest accuracy.
However, this tradeoff between precision and recall renders SSD less effective for
scenarios in which it is imperative that weapon detections be both precise and reliable,
including public safety applications [2, 3].
2.7.3 Accuracy vs Efficiency
YOLOv5, despite being released unofficially from the Ultralytics team, has proven
to be faster and more accurate than YOLO v4. The real-time performance of YOLOv5
is better than that of YOLOv4 and has made good results on a wide range of object
detection tasks such as weapon detection [1, 2]. YOLOv5 will be scaled to both
server and edge ends making it flexiible for real-time applications that need a trade-
off between accuracy and computational mechanics.
The newest version, YOLOv8, has gone one step further. YOLOv8 offers four ver-
sions: YOLOv ‘n s’, YOLOv ‘s 8y’, YOL / YOLO-v’5x ’, each of which is specialized
for different hardware. Supported on lightweight edge devices, YOLOv8n is intended
for deployment in low-power platforms. On the contrary, YOLOv8m is a trade-off of
accuracy and speed so that it can be adopted on real-time surveillance system with
high performance detection similar to upper version but without latency [1]. The
ability to tune performance for both accuracy and running time is a huge evolution-
ary step over standard YOLO, as it means that YOLOv8 has broad deployability –
from high-end servers down to the power-constrained edge(like Jetson Nano)[2].
The main benefit of YOLOv8 is that it is balanced, and offers the right trade
offs for all deployment targets. Because YOLOv8n is lightweight, it can be imple-
mented in embedded hardware platforms with low computational resources, whereas
15

YOLOv8m is more suitable for stronger server-side solutions to support better accu-
racy. YOLOv8’s versatility makes it an ideal choice for weapon detection in crowded
public places such as train stations or malls, where the need is both for speed and
accuracy.
2.7.4 Edge Deployment and Lightweight Models
There is a demand for edge deployments in security based systems, and there
has been an effort towards light weight object detection models which can be run
on embedded platforms such as Jetson Nano to perform real time weapon detection.
Models such as MobileNet with YOLO yield good results in weapon detection in such
restricted environments. MobileNet is famous for its speed and efficiency and can
run on low powered devices. The detection performance from MobileNet+YOLO is
slightly lower compared to high capacity YOLO models such as YOLOv4, but the
speed advantage of this type of network is valuable for edge applications so that it
can be recommended [2].
Such versions, designed to be efficient (entirely computationally rolled-up) are
especially handy in settings demanding real-time processing under a limited amount
of computational resources. Embedded devices (surveillance cameras, drones, etc
personalizes the name like), which have very limited hardware resources can take
advantage of recent research on MobileNet and YOLO they offer a trade-off between
accuracy and resource utilization. This trade-off allows weapon detection systems to
be installed in different conditions without the need of costly hardwar [1, 4].
2.7.5 Unified Benchmarking and Implementationstrconv
However, one of the most important drawbacks of previous works is that it misses
direct comparison between different YOLO-type networks in the same framework.
Previous works have generally been confined to single models or datasets, and thus
it remains difficult to generalize the trade-offs of accuracy, latency and deployment
availability. This work attempts to address the gap above by an in-depth comparative
study of YOLOv8n, YOLOv8s, YOLOv8m and YOLOv5s for real-time, edge-based
weapon detection in surveillance. In summary, this work generates benchmark results
and a unified evaluation framework for weapon detection which includes trade off
analysis among the various YOLO versions as well as on real case deployment giving
clear understanding about when to use what in terms of making efficient and reliable
weapon detection system [5].
16

CHAPTER III
Methodology
3.1 Dataset and Splits
I use thegun,
train (≈4,400), valid (≈1,040), test (≈380) images. A cleaned Guns-Knives dataset
was utilized, which was prepared having labeled images from two classes (gun and
knife). The dataset includes a wide range of real-world conditions such as different
backgrounds, illumination and object size. The images were collected from various
general sources and explanations specifically for the balanced treatment of instances
of weapon,s either in in-house or outside scenes.
The data were split into training, validation, and testing datasets. The training,
validation and test sets contain roughly 4400, 1040 and 380 images, respectively.
These splits are meant to enhance the generality of the models learned on this dataset
over unseen data, which is an important aspect given their adoption in real-world
video surveillance systems. The number of classes (guns and knives) is quite balanced
for these sets so as to avoid a bias in the models. Class distribution was also validated
for consistency among splits (see Figure 3.1), where a comparable pattern of class
distribution can be observed in training, validation and test sets.
The diversity of the data set, as well balanced class distribution,will make models
trained on this dataset more capable in real-worldcases. In addition, the dataset is
created specifically to emulate difficulties experienced in surveillance systems (trans-
formation, illumination variation and cluttered background). It makes the dataset
ideal for training models that are required to use under different environments, which
may be encountered in realistic case of hidden or cluttered scenes.
Also, the balanced split of our dataset makes sure that models do not only over-
fit to a single class and would detect guns/knives in similar performance. In the
context of weapon detection, such a balanced distribution is mainly important since
17

imbalanced sets may lead to biased predictions that do not work well for then under-
represented classes. The importance of a balanced dataset, such as this one, could
be read on previous works which efforts was into avoid the unbalanced data that
translated in losses of accuracy and rise number of false positive cases on detection
systems [1, 2].
Figure 3.1: Class distribution (training). Validation and test follow similar trends.
3.2 Preprocessing and Augmentation
Several preprocessing and augmentation techniques were used to enable the models
to generalize well across different conditions. Preprocessing involves standardization
of 640x640 normalization and image size as a pre-processing step since the base input
resolution for YOLO models is also 640x640. This rescaling not only properly scales
the input images for model compatibility, but also allows the model to learn features
effectively from data. In addition, resizing to this resolution leads to computational
simplicity and also preserves adequate level of detail for good weapon detection [1, 2].
The augmentation methods employed are proposed to mimic wide range of realized
lighting scenarios that can be encountered by a weapon detection system. These
techniques include:
18

Figure 3.2: Class distribution (validation split).
3.2.1 2.2 Horizontal Flips
We flip the images horizontally during training session.It aids in modeling to
simulate multiple viewing perspectives and making it more robust with respect to
weapon orientation. In practical surveillance systems, the orientation of weapons
may vary depending on the standing position of the bearer. By feeding the flipped
images, the model learns to recognize weapons from various orientations, making it
more universal [3].
3.2.2 HSV Jitter
HSV jitter is to randomly perturb the hue, saturation, value (in brightness) of
images. This emulates various light environments and makes the model more robust
in different scenarios. There is very evident variability across real-world settings, from
bright sunlight to dark spaces. HSV jitter assist the model in learning to deal with
these lighting variants, and guarantee the system can operate robustly under different
visual situations [4].
19

Figure 3.3: Class distribution (test split).
3.2.3 Mosaic
The mosaic augmentation is based on combining multiple images into a single
training image. This approach helps model to learn better feature representations,
as it presents the model with different object compositions in a single image. It
is especially useful for managing occlusions or partial weapon views, considering it
resembles real world scenarios such as objects partially covered or hidden by other
objects. Mosaic augmentation appears to enhance the model’s performance in small
objects or partly occluded ones.[6].
Such enhancements are important, as they make the model more generalisable to
a broader set of realistic conditions. It further enhances the robustness of the model
by adding in variations such as lighting, rotation and object occlusion that in real
surveillance settings. Furthermore, these methods help to overcome overfitting by
means of offering the model a broader quantity and more variate source of training
samples which allows the model perform better in weapon detection under different
circumstances. As shown by the prior work, such augmentations are pivotal for
improving the performance of deep learning models on object detection problems and
especially in difficult scenarios characteristic to surveillance and security applications
[2, 10]
20

Figure 3.4: Split sizes (number of images).
3.3 Models
In this work we compare four models: YOLOv8n, YOLOv8s, YOLOv8m and
the reference model of our method YOLOv5s from the Ultralytics YOLOv8 family.
These four models are selected as they cover diverse network architectures from small
to large size, varying computational complexities and hence, can be compared in both
accuracy and efficiency. The variations mainly in network depth, parameters amount
and implementation on other hard ware devices. All models were trained with the
same settings, including hyperparameters.
YOLOv8n (nano) This is the smallest and fastest version of YOLOv8 targeted
for edge, IoT or embedded devices with low computational requirements. YOLOv8n
has a smaller parameter count, so it can make predictions faster while being less
accurate on average than YOLOv7. It’s designed for efficiency and this is a few
orders of magnitude faster than existing yolo implementations. YOLOv8n is efficient
and suitable for applications such as a CCTV system or an embedded device in that
it has much low latency and power consumption [1, 2].
YOLOv8s (small): YOLOv8s is the bigger version of the YOLOv8n that compro-
mises between speed and accuracy, enabling inference on more powerful edge devices.
YOLOv8s has higher detection precision and lower computational cost compared
with YOLOv8n. This model is a good fit for mid-range devices with some ability
21

of boosted computing and limited performance-speed trade-off3. YOLOv8s is com-
monly implemented in latency demanding applications including retail securit and
public safety.
YOLOv8m (medium): YOLOv8m is designed for mid-range devices, it has a
slighter better accuracy but requires slightly more computation. It is a compromise
model that provides a great accuracy-latency trade-off. YOLOv8m is suitable for use
cases where a bit more to the accuracy of a model is required and the resources can
afford slightly more complex models. This model is generally used in applications
such as high-end surveillance camera systems or robotic systems, where real-time and
accurate performance are both important [5, 6].
YOLOv5s: As the most lightweight and efficient member of YOLOv5, YOLOv5s
also is involved in this comparison for completeness. YOLOv5s performs consistently
well in object detection, and is widely recognized as its excellent performance on the
edge devices like real-time applications. Though from an outdated version of YOLO,
YOLOv5s is still a supplemental comparision model with moderate tradeoff accuracy
and inference speed. It has been used in industrial applications, including automated
quality control and security monitoring systems.
All of these models used the same data set and setup to ensure a fair comparison.
Hyperparameters (number of epochs, batch size and learning rate) were not tuned
for different models to avoid overfitting the model. Models were developed running
on top of the Ultralytics YOLO framework, which comes with optimized version of
YOLO for training and inference. The framework is very efficient, with fast training
times and applicability to real-time systems [12]. Performance of the models was
measured using standard metrics (mean average precision (), precision, recall) and
inference time, to compare accuracy and efficiency under same environment.
3.4 Training Configuration
Hyperparameters used for training the models.

tion problems. Previous research has demonstrated that 20 epochs are usually
sufficient for models to converge without overfitting to small datasets [1, 2].

is training a reasonable number of images at once. A batch size is a critical
parameter that will influence the convergence rate of a model and its memory
22

usage during training. Similar object detection tasks are commonly run in a
batch of 16 to maintain a reasonable trade-off between speed and performance
[3, 4].

models. The reason why this scale is widely used is a trade-off between the
ability to compute quickly and the accuracy of the detection. Down sampling
the image may accelerate training but with the price of loosing crucial features
and up sampling might result in longer training time [5].

to be one of the best optimizers to train deep learning models. Adam combines
the best of both AdaGrad and RMSProp, and thus, it is a perfect option when
training deep networks. It exponentially decreases the learning rate based on
the approximations of first and second gradient moment estimates, which makes
it converge faster and perform well compared to traditionally used gradient
descent algorithms [6].

library was employed for learning rate annealing during training. The learning
rate scheduler adapts the learning rate according to the training process so
that the model can converge well. This trick has found widespread use in deep
learning for practice to stabilize training and improve the accuracy of a trained
model [10].

crease in-in three successive epochs. You use early stopping to avoid overfitting,
and stop the training when the model is no longer performs better on the val-
idation set. This approach is particularly helpful in object detection problem,
because training the deep networks on small datasets can result in over fitting
[9, 12].
Training was done on a GPU, as training YOLO models is resource-hungry. Train-
ing is also much faster when using a GPU, especially for large datasets and architec-
tures like YOLO. With the help of GPU-accelerated training, deep learning commu-
nity can speed up and enhance its efficiency 10.Other random seed values have been
set stable to make the results reproducible. With random seed fixed, training pro-
cess becomes deterministic so we can get the consistent results of experiments across
different runs, which is essential for accurately evaluating model performance [11].
23

3.5 Evaluation Metrics
The performance measures of the models were evaluated using the metric:

old of 0.5. This is the conventional evaluation for object detection and evaluate
how well predicted boxes overlap with ground truth. It is a popular metric in
object detection to evaluate the quality of the model on predicting objects’ lo-
cations in images. The higher the mAP@0. 5 (the higher, the better the model
can localize objects with correct boxes [1, 2].

the model’s performance across multiple IoU thresholds (simulated from 0.5 to
0.95). This measure is significant as it summarises the performance of the model
over a range of IoU thresholds, depicting that the model’s accuracy is not only
optimal at a particular threshold but under different scenarios. mAP@0. 5:0.95
is consired more strict manner to measure model performance, specifically at
the object detection task which would have some level of IOU(center) fluctuated
between predicted and ground truth boxes [3, 6].

dictions the model made. It’s a very important metric to consider when you
are working on tasks where false positives are more expensive. In the detection
of weapon, having a high precision implies that when the model predicts with
presence of weapon it is likely to be true [5].

by the sum of all positive instances in the dataset. For the task of weapon detec-
tion, a high recall implies that most weapons in an image have been detected
and thus minimal chance is there to miss any weapons during the detection.
Nonetheless a higher recall is normally achieved at the cost of accuracy, and
thus the trade-off between these 2 metrics become crucial in real time applica-
tions [6, 10].
24

The PR-curves of the models were also plotted in order to illustrate the trade-offs
between precision and recall at various decision thresholds. A PR curve makes it
easier to find the optimal trade-off between precision and recall, especially in imbal-
anced data sets such as ours (the number of gins and knives may not be the same).
By drawing these curves, one can assess the performance of the modelfor different
thresholds and chose a threshold that fits the operational requirement best [9, 12].
Misclassifications and model’s performance in separating guns from knives were
further analyzed using the Confusion Matrices. A confusion matrix provides a more
in-depth breakdown of the model predictions, with a breakdown of how many objects
were correctly or incorrectly classified. This information is vital to anyone who is at-
tempting to know the areas were their model fails (failure cases between distinguishing
weapons and using tools. [13]
25

CHAPTER IV
Results and Discussion
4.1 Results
I evaluate YOLOs with regard to performance on the Guns-Knives dataset in
terms of weapon detection. We have experimented with 4 YOLO models: (YOLOv8n,
YOLOv8s, YOLOv8m and YOLOv5s) and have significant scores in the form of mAP
at 0 5, mAP at 0. 5:0.95, precision and recall. The test findings are involved in the
tests of accuracy and efficiency regarding any model in cases of weapon detection.
As it can be seen, YOLOv8m is the best when mAP 0. 5 (0.928) and mAP 0.
5:0.95 (0.671); that is, it could be used to successfully identify the gun as well as the
knife, reducing false positives and false negatives. This is why YOLOv8m is the most
suitable model in the high-performance setting with accuracy and speed of detection
as critical factors [1, 22].
Conversely, the YOLOv8n (the smallest in size and computation power) offers the
accuracy of 0.450 with mAP0 of 0.872, which is a decent performance for lightweight
models. Though YOLOv8n compromises accuracy with higher velocity, it remains ca-
pable of achieving high quality weapon detectability, making it best suited to the edge
deployment scenario where low-latency and computational efficiency are important
[3, 4].
The YOLOv8 and YOLOv5 are the best trade-off between speed and accuracy.
The performance of YOLOv8s is slightly better than that of YOLOv5s, using bigger
mAP@0. 5 and mAP@0. 5:0.95 values. Nevertheless, the two models can be used
for mid-range scenarios in which there are more computational resources than edge
devices, while still requiring a real-time performance. These models are tuned between
the very accurate YOLOv8m and efficient YOLOv8n, to be well suited for a large
spectrum of real applications where quality and efficiency matters both [5, 6].
26

4.1.1 Precision
According to the obtained results, YOLOv8m has reported best precision (0.901),
and recall (0.889) which means it can correctly detect the desired weapons such as
guns and knives with less number of false positive and false negative predictions
compared to other state-of-the-art models. YOLOv8n is less accurate than YOLOv7
but it retains a strong recall (0.845), thus being able to detect most of the weapons in
real time applications, particularly on edge devices. Meanwhile, the second generation
YOLOv8s/YOLOv5s which take a compromise of precision and recall also achieve an
overall better performance for real-time weapon detection.
These evaluation metrics, and, more specifically, the precision-recall trade-off pro-
vide an insight into the strengths and weaknesses of different models. In areas that
treat false positives as a major concern (e.g., security systems in open public environ-
ments), YOLOv8m, with higher precision and recall, should be chosen. But in more
restricted scenarios where computer resource is constrained, YOLOv8n may enable
an acceptable performance without sacrificing speed [12].
one the other hand According to quantitative results, YOLOv8m is the best model
with both accuracy and speed performance which could be used in real-time high-
quality surveillance system. High-speed deployment on the edge is most effectively
achieved by YOLOv8n, while mid-level use can be handled with a trade-off between
speed and precision of YOLOv8s or YOLOV5s [14].
Table 4.1 presents the performance measures of evaluated models:
Table 4.1: Leaderboard of YOLO models on the Guns–Knives dataset (replace with
your Colab outputs).
Model [email protected] [email protected]:0.95 Precision Recall
YOLOv8n 0.872 0.593 0.860 0.845
YOLOv8s 0.901 0.642 0.884 0.867
YOLOv8m
YOLOv5s 0.895 0.627 0.872 0.860
27

4.2 Confusions and Error Shape
Although the YOLO models performed satisfactorily in general, there are still
some cases of wrong classification. These errors become even more pronounced for
small or occluded weapons that are often mistaken with other items. The most
frequent misclassifications are:
4.2.1 Small Knives
Searching for small knives, in particular if they are only partly occluded and
at some angle, is difficult. These items are easily confused for things like metallic
tools, slim cylindrical objects and other similar shaped features. Identifying small
knives as tools or other things is a classical problem of object recognition, which is
particularly difficult when the object disappear partially from the detection field or
is occluded by certain items in the scene. Models may benefit from hard-negative
mining approaches [21], which emphasize the difficulty of examples to be classified.
Hard-negative mining, which detects and concentrates on those samples that the
model fails to predict correctly, meantime, is available for enhancing the performance
of the model in predicting difficult instances under these cases [3].
4.2.2 Obstructed Guns
Although guns are identified with high confidence, objects that have an appearance
similar to a gun shape (long handled tools) could rarely be detected. These are
frequently caused by object shapes getting mixed up. For instance, metal sticks
can be mistaken for guns, especially when it appears protruding and amid a cluttered
visual environment. However, this case happens in practice where the model may find
the outline or the shape of a gun-like object but wrongly classify it with other objects
that share a similar shape [4]. In order to mitigate these mis-classifications, better
occlusion handling during augmentation can be added. Occlusion increase technology:
Simulation of partial obstructions of the objects, this can make the model learn to
identify the object better in case when it is not visible entirely [5].
4.2.3 False Positives and Negative Samples
False positives which are non-weapon objects misidentified as weapons is common.
It often occurs where other objects such as bags, bottles and norms actually possess
components that the model has learned to associate with weapon like object. False
28

positives with weapon detection systems can be hazardous particularly in highly risky
locations where false alarm may be undesirable. To deal with these problems, more
advanced methods, such as multi-class classification and context aware (identify the
surrounding environment), can be implemented to enhance the performance of model
in differentiating between the weapon and non-weapon objects [6, 10]. These mis-
classifications indicate that there is a need to keep improving on models of detection,
especially when there is partial visibility and occlusion and similarities among objects.
Further increasing a more challenging set of examples and making it more varied, by
employing techniques such as hard-negative mining, or more advanced augmentation
techniques can further serve to reduce the error rates in real-world deployments[9].
4.3 Qualitative Results
In order to have an insight into the models’ performance, qualitative results were
achieved in form of visualisation of detected bounding boxes by each YOLO model on
test images. These visualization examples demonstrate the generalization ability of
our model to localize weapons under different conditions: partial occlusion, different
lighting and objects sizes. The use of qualitative visualizations is a clear way to
evaluate how well the proposed solution generalizes to real-world situations, as it
provides the possibility of observing the behavior of the model under different (and
sometimes unexpected) conditions.
The visualizations confirm that the YOLOv8m generates the most accurate bound-
ing boxes, tightly aligning with guns and knives. In challenging situations, such as
recognizing partly occluded weapons or weapon in crowded scene (detecting smaller
objects can be difficult) YOLOv8m shows excellent results. The robustness to these
hard cases before, gives Model the most efficient and stable performance for use in real
world scenarios such as surveillance systems, where objects may be partially occluded
or cluttered by other objects in a scene [1, 2].
The smallest and fastest model YOLOv8n sometimes does miss small objects or
false detect some objects. Despite these mistakes, YOLOv8n works well enough to
be useful given its more compact size. As an inefficient and real-time, edge-devices
capable algorithm, it is suitable for applications with limited computational resources
e.g. embedded devices or surveillance cameras which have low processing power. Nev-
ertheless, as a result of its smaller network size, YOLOv8n also encounters difficulites
when dealing with complex cases like small object detection or objects occluded by
29

other miscellany items [4].
The capacity of YOLO models to work in real-world scenario (blurred or lowlight
environments) makes a system based on it extremely suitable for surveillance appli-
cations. These models show resistance to several challenges such as lighting condition
changes, motion blur, and occlusions that are prevalent in real video streams. For
instance, YOLOv8m is resilient to challenging light conditions in which other meth-
ods may malfunction due to low visibility of the weapon [5]. This ability to work in
dynamic and perhaps uncontrolled ambient conditions is important for practical mon-
itoring of public areas where lighting conditions may vary considerably and people
may be in rapid motion.
That being said, there is definitely room to change exit-velocity. Performance in
hard real-world scenarios could also be improved by augmenting the training (adding
more realistic object occlusions and lighting changes) and enhancing training perfor-
mance. Such augmentation techniques simulate the range of situations a model may
encounter in the field, and therefore can be trained to be more sensitive in recognizing
weapons in a variety of circumstances [6, 10].
Furthermore, some additional variability on occlusion ( object partially covered by
others ) and lighting condition changing ( night time or shadowy conditions) should
be applied since the robustness has the potential to be improved further in a real-life
scenario with environmental variations. Various environmental conditions, including
weather and indoor/outdoor environments, can be included in the training set to
promote the adaptation ability of the model [12].
Figure 4.1: End-to-end study pipeline.
30

CHAPTER V
Performance Analysis
5.1 Performance Evaluation
The four models are evaluated using a variety of key measures that indicate both
their accuracy and efficiency. These are mean Average Precision at 0.5 IoU (mAP@0.
5), average precision at different IoU thresholds (AP@5:0.95), precision, and recall.
These metrics are important steps in order to be able to appreciate how good is
each of them at detecting weapons and under what circumstances they work more
effectively.
YOLOv8m achieves the best model in mAP@0. 5 (0.928) and mAP@0. 5:0.95
(0.671). Results show that our YOLOv8m reaches the best level in weapon detection
and small size occlusion. It is also presented with the highest precision (0.901) and
recall (0.889), indicating that it can simultaneously detect firearms and knives effec-
tively while avoiding false positives and false negatives. The better precision would
show that when YOLOv8m predicts the weapon to be present, it is more likely right
and a high recall value would imply it can detect most of the weapons in our dataset
even with challenging scenarios. YOLOv8m is the best choice for real-word applica-
tions which require detection accuracy and model reliability at the same time, high
performance surveillance system [2].
On the other hand, the smallest model in terms of file size and computational cost
(YOLOv8n) has an mAP@0. 5 has a 0.872 AUC score, lower than our bigger models
but still indicates decent performance. It traded off part of its accuracy for speed and
efficiency that became suitable to be deployed on edge devices with scarce computing
resources. Even in this model, the precision (0.860) and the recall (0.845) are still
quite good, which means that it can correctly detect weapons but may not work well
on small or partially covered objects. YOLOv8n is also ultra-lightweight, although
with these trade-offs, well adapted to applications that need minimum computing
31

overhead and real time processing even in embedded or low-priced surveillance camera
applications [4].
The accuracy and efficiency of YOLOv8s are balanced, with an mAP@0. 5 of 0.901
and mAP@0. 5:0.95 equal to 0.642. It is slightly more accurate and efficient than
YOLOv5s. YOLOv8s Usages: Mid-tier deployment, where there are limited edge
devices, but which are richer than embedded devices. Real-time object detection
is a need. The accuracy (0.884) and the recall (0.867) imply that it is effective in
identifying weapons and it balances speed and quality of detection well. This type of
model is applicable to the use cases that require a trade off between cost, efficiency and
performance. The older generations of the YOLO model also have strong competitors,
such as YOLOv5s with the mAP@0. 5 of 0.895 and mAP@0. 5:0.95 of 0.627. It has
also been established that YOLOv5s can perform well in real-time or edge computing
applications where the efficiency of the computation is important. Its specificity
(0.872) and sensitivity (0.860) appears to show that it can be used to reliably identify
weapons with a low false positive rate. YOLOv5s is not as accurate as YOLOv8m, but
is a trade-off between speed and accuracy; it is useful in middle-range deployments
where real-time is needed but the computational resources are constrained. [10, 9].
On the downside, YOLOv8m is too good both in accuracy and efficiency to be
worth considering for fast detectors. As for YOLOv8n itself it may have lower accu-
racy but it is the fastest and can be deployed in real-time on power constrained edge
devices. YOLOv8s and YOLOv5s reached a good balance between speed and accu-
racy, which makes them suitable for mid-level deployment applications where some
performance is required but computational resources allow more complex models than
YOLOv8n [13].
5.2 Comparative Analysis
In this section, we present comparative analysis of 4 YOLO models in term of
accuracy and latency. The performance of all the models were assessed in terms of
prominent performance measures such as mAP@0. 5, mAP@0. 5:0.95, precision, and
recall. When it comes to the specificity of deploying the models in a diversity of
situations, the observed models, YOLOv8m, YOLOv8n, YOLOv8s and YOLOv5s,
are denoted with varying performances.
YOLOv8m has the best overall performance, with an mAP at 0. 5 of 0.928 and
mAP at 0. 5:0.95 of 0.671. This model has a greater capability of capturing small or
32

obscured objects and it is the most precise in weapon detection. It is also accurate
(0.901) and recalls (0.889) thus meaning it can identify weapons that are sure of
a minimal false positive and false negative. YOLOv8m would be the appropriate
model in case you were deploying a server and accuracy is highly valued, yet they
also possess numerous computational means. It has high performance and thus can
be used in such areas as surveillance systems in the community, police surveillance
and other locations that require high sensitivity detection [1, 2].
However, the higher computational complexity of the model may be expensive
to implement on the edge in real-time (when faster inference needs). On the other
hand, YOLOv8n performs the poorest (mAP@ 0.5 = 0.872). YOLOv8n is not as
efficient as the other models, although it is not as strong in detection performance
as them. The affordability of its computation cost also allows one to run the model
on a device with constrained hardware resources and makes it an enticing solution
to real-time surveillance systems, particularly in edge devices (e.g., CCTV cameras
or drones) where resource constraints are paramount. In this way the YOLOv8n
would compromise accuracy to speed, thereby it can be used on devices with scarce
computing power but demand real time performance[18, 20].
YOLOv8n is not as precise, however, it is a very good choice when you want to
save as many computational resources as possible, while not sacrificing performance
significantly. It offers a trade-off of speed and accuracy. YOLOv8s offers a tradeoff
between cost and efficiency with an mAP of 0.5 of 0.901. It performs best when
it comes to mAP, and YOLOv8n has a rather moderate computational cost, which
can be deployed on mid-range devices with restricted computing resources. This
system will be applicable in applications that require real-time processing along with a
relatively high accuracy that is needed at retail security or in public space surveillance,
where speed and accuracy are imperative. Its accuracy stands at 0.884 and its recall
at 0.867, indicating that YOLOv8s can consistently recognize weapons in different
situations without reducing its performance [5, 6].
In the same manner, YOLOv5s, even though it is from an old version of the YOLO
family offers good performance with mAP@0. 5 of 0.895. This model is another great
option for speed and accuracy applications, but if system resource availability just
misses the mark needed for YOLOv8m, our recommendation would be to go with
VOC. YOLOv5s is on a par with YOLOv8s in accuracy and efficiency, which allows
it to be considered as a competitor for deployment scenarios where speed is an issue.
Its inferior performance to YOLOv8s in not so high, thus it is still suitable for real-
time applications using average computational resources (industrial or middle-scale
33

security) [9].
Both model types have various advantages depending on the deployment situation.
Pioneering the Realm of SDL Overhead Analysis YOLOv8m is an optimal choice
for high performance systems where high accuracy is needed, YOLOv8n is perfect
for edge applications with emphasis on computational efficiency and low latency,
while YOLOv8s YOLOv5s provide a balanced choice in midrange deployment. The
different strengths of such models lend flexibility to variety of real-world applications,
for which concerns such as computation power, reporting time interval and accuracy
should be take in account [15, 17].
5.3 Comparison of Accuracy vs Latency
There are several factors that affect the real-time detection of weapons in a surveil-
lance system among which is the bias-variance trade-off. Since surveillance systems
usually function in real time scenario, the models also need to maintain tradeoffs
between accurate object detection and faster inference times. This trade-off is very
important while deploying deep learning model on platforms of different computa-
tional capabilities as well.
YOLOv8m, which has the best accuracy performance (in terms of mAP) with
an mAP@0. 5 of 0.928 and also has the highest computational intensity, leading
to a lower frame rate. This model performs at around 20 frames per second (FPS),
which is acceptable for server-side deployment of inference, but not sufficiently fast for
real-time edge devices. You need the performance and high-speed detection followed
by YOLOv8m. The fact that it can effectively and accurately detect even tiny and
hidden objects also makes it adequately applicable to security-sensitive applications
like law enforcement surveillance and monitoring of crowds where high performance
in detection is of paramount importance [7, 8, 22].
The smallest and lightweight YOLOv8 model that can achieve real time inference
with 40 FPS is YOLOv8n. This Fast FPS allows YOLOv8n to execute in the re-
sponsive mode on embedded low computational platforms and as such is well suited
to edge computing applications where time is paramount. At the cost, however, to
Patch-Split detector accuracy, where an mAP@0. 5 of 0.872. Although less accurate,
the efficiency and frugality of YOLOv8n allow it to be an ideal option for real-time
monitoring, including but not limited to CCTV (closed-circuit television) cameras and
drones since the urgency of object detecting takes precedence over accuracy [19, 20].
34

YOLOv8s and YOLOv5s are on par with YOLOv8m+, and in the middle between
YOLOv8m/n for accuracy and latency. YOLOv8s, with an mAP@0. 5 of 0.901 is
optimal in terms of performance and efficiency. IT ARTIFICIAL show better perfor-
mance than The YOLOv8n with not much lower FPS values so it could be applied
to the scenarios where a tradeoff between the speed and accuracy can acceptable.
Though the FPS could be different on a hardware level, YOLOv8s has a considerable
enhancement in detection precision compared to YOLOv8n while maintain an ade-
quate speed for real-time scenarios. Similarly, YOLOv5s, with an mAP@0. 5TFP
achieves similar performance as YOLOv8s in accuracy and FPS, and it is an alter-
native for applications that require a trade-off between speed and detection quality
[10, 9].
The chart plotted below gives an overview of the accuracy on the validation set
(mAP@0. 5) and delay (FPS) for each model. As a natural expectation, the larger
models (YOLOv8m and YOLOv8s) have better accuracy but higher latency. On
the other side of the spectrum, performances regarding latency are best achieved
with YOLOv8n which is selected for edge deployment where inference time plays a
significant role. This performance comparison demonstrate material trade-off between
model complexity, accuracy and computational cost in real-time objects detection
systems [13].
v8n YOLOv8n runs at40 FPS, it can run real-time even on embedded systems
with poor computational power (low-latency applications). efficiency in detection
has not been enhanced, but the accuracy of detection is larger, and it is applicable
to High-performance surveillance system. (YOLOv8m: lower speed,20 FPS) [14]
35

CHAPTER VI
Conclusion and Future Work
6.1 Conclusion
This paper suggests a single benchmark to evaluate four YOLOs based architec-
tures: YOLOv8n, YOLOv8s, YOLOv8m, and YOLov5s in the weapon detection task
using a self-constructed dataset Guns-Knives. The evaluation of the performance was
based on the mAP at 0, and brand hit rate which aimed to estimate the accuracy and
computational speed up of the models that were to be used in real time surveillance
systems. The outcomes of such examinations offer a full picture of the advantages
and limitations of both models that can be applied to the decision on the type of
architecture to implement in various situations.
The findings are a clear demonstration that:

5:0.928 and 0.671. It is highly accurate at locating weapons with a high accu-
racy rating and is best applied in server side scenarios when computer power is
not limited and when a high degree of precision detection is critical Superior per-
formance can be achieved in complex environments, small and occluded weapon
detection, YOLOv8m is the best available model for high-effective surveillance
system application in law enforcement and public safety system [1, 4].

of them, yielding around 40 FPS. That makes YOLOv8n great for embedded
edge devices like CCTV cameras or things with the power of a Pi. Although
YOLOv8n is of medium scale, the model size is not heavy so that the perfor-
mance with respect to detection can be accepted, it can have good application
performance in real-time scenarios when demanding for a low latency. It is
36

computationally efficient, and can be applied to cases where computational re-
sources are constrained, and represents an intermediate between performance
and complexity [5, 14].

cally calibrating to the computational budget at deployment time and offering
a viable solution for mid-range devices where both processing power and real-
time performance matter. YOLOv8s, with an mAP@0. 5 of 0.901, YOLOv5n
has improved balance of accuracy and speed to that of YOLOv5s, which is
competent for application with relatively high accuracy but limited compu-
tational resources. [39] is aimed to develop a real-time and high-quality eye
tracking model (Referenced as ETNet). YOLOv5s, although being less accu-
rate (mAP@0. 5 = 0.895) has comparable performance and is more suitable for
real-time use in less constrained systems [6].
By inspecting the precision-recall curves, confusion matrices, and qualitative re-
sults, we found that small positive samples as well as distorted positive samples are
the major issue in the weapon detection. Such difficulties were particularly noticeable
with the identification of knives or other objects with the shape of a knife; the models
had difficulty distinguishing between weapons and analogue items. The emphasis of
good hard-negative mining schemes like the ones that focus on finding and learning
out of harder examples are in these mis-libel. More advanced forms of augmentation
like occlusion or light might also help in reducing these constraints and even enhanc-
ing the detection accuracy even more; particularly in more demanding real-world
scenarios [10].
The work points out the natural accuracy versus computational cost trade- off
that is one of the essential demands of real-time monitoring applications. We might
choose alternative models in case there is a change in the deployment of the network.
YOLOv8n is also good with trade-offs between accuracy and speed where edge devices
are constrained by the amount of computation resources. At the other extreme,
on high-precision solutions that are server based it does (with some computational
cost) outperform (compared to Darknet53) in accuracy - YOLOv8m. YOLOv8s and
YOLOv5s provide reasonable solutions to a middle-end deployment, where efficiency
and precise detection are required. Overall, the work offers some of the first systematic
examination to assist in determining which version of YOLO one must adopt when
deploying it based on the requirements that behead to the construction of effective
and efficient weapon-detecting systems in real-world surveillance applications [11, 7]
37

6.2 Future Work
Original contribution The reported research makes important contributions, but
there are still several limitations and future directions.

two classes of weapon types, and hence this model can be very poorly gen-
eralized to other types of weapons. The model would be stronger and more
practical in real life by adding explosives or weapons other than guns to the
dataset.A dataset that is more balanced towards diverse weapons would enable
better generalization and accuracy to different weapon types seen in surveillance
systems [1, 2].

tection, thus not modeling the dynamical aspect of real-world video perception.
Temporal models, such as LSTM networks or 3D CNNs, can lead to improved
detection performance for fast moving or briefly visible weapons. Incorporat-
ing temporal relationships may enhance the detection of weapons in dynamic
environments, where objects swing over and around, such as in crowded open
spaces or a moving criminal [3, 4].
Scene-Specific YOLOv8n functions well on the edge, but it can be optimized on
the edge such as quantization, pruning and knowledge distillation. The methods
have potentials to alleviate the heavy computational burden and without a great de-
crease in the accuracy. Model compression might make stronger models, for example
YOLOv8m, possible to work efficiently on resource-limited embedded systems, such
that it can be directly applied for low-latency real-time scenarios [5, 6].
Multi-modal Fusion We only used visual information, but incorporating other
modalities (audio signals of gunshots, or motion sensing data) could potentially im-
prove accuracy and lower false positives. A multi-modal fusion systems, using video,
audio or sensor data could improve detection robustness, particularly in noisy envi-
ronments. Developing sucha method could result in more reliable solutions utilizing
other tracking cues [10, 9].
The dataset we used for testing is a synthetic dataset rather than the real-world
scenario. Low-light situations, occlusions, and backgrounds make the weapon detec-
tion a quite difficult task. Model training should be directed to learning from a noisy
38

and heterogeneous dataset that could reflect real-world challenges, then an estimation
of how the models would fare under such test conditions can be obtained [12, 13].
In order to study weapon detection as encountered in real settings, it is interesting
for future work to deploy these models at operational sites like airports and schools.
Performance of the models under real life conditions like latency, power consump-
tion and real time decision making can be analyzed by running the models in these
environments [14, 11].
In safety-critical applications such as weapon discovery, it is crucial to interpret
the model decisions. Although many deep learning models, such as CNNs, have
good performance in practice, they are also often referred to asampquotblack-box
similarampquotmodels. Further work should therefore improve the interpretability
of model’s decisions, for example with techniques such as Class Activation Mapping
(CAM) or saliency maps to guarantee that these outputs are trustable by decision
makers [7, 8].
39

References
[1]
real-time cctv videos using deep learning,”, vol. 9, pp. 34366–34382,
2021.
[2]
imfdb.” Unpublished/venue not specified, 2017.
[3]
Unpublished/venue not specified, 2019.
[4], “Ssd vs faster r-cnn for firearm identification.” Unpublished/venue
not specified, 2020.
[5]
learning for weapons detection in surveillance videos,” in, IEEE, 2021.
[6], “Gun detection in surveillance videos using deep neural networks,”
in, (Lanzhou, China), 2019.
[7], “A deep-learning framework on edge devices for handgun
and knife detection from indoor surveillance cameras,”
Applications, vol. 83, pp. 19109–19127, 2024.
[8]
and handguns.” Unpublished/venue not specified, 2020.
[9]
of armed persons using cnns,”, vol. 2,
no. 2, 2025.
[10]
detection in videos,”, 2025.
40

[11]
“Object detection binary classifiers for small handheld objects in surveillance,”
Knowledge-Based Systems, 2020. Volume/pages not provided.
[12], “Handgun detection using combined human pose
and weapon appearance,”, vol. 9, pp. 123815–123830, 2021.
[13]
detection with deep learning in video surveillance images,”,
vol. 11, no. 13, p. 6085, 2021.
[14], “Automatic detection of weapons in surveillance cameras using
efficientnet,”, vol. 72, no. 3, pp. 4615–4630,
2022.
[15]
rnn/lstm/gru for early fire detection.” Venue/journal not specified, 2022.
[16], “Optimized yolo for real-time flame detection.” Unpublished/venue
not specified, 2018.
[17], “Hybrid two-stage deep learning for fire/smoke with yolov5.”
Unpublished/venue not specified, 2021.
[18], “Deepsmoke: real-time smoke detection and segmentation,”
Systems with Applications, 2021. Volume/pages not provided.
[19]
detection,”, 2019.
[20], “Forest fire/smoke detection via transfer learning and
learning without forgetting,”, 2023.
[21], “Yolo-based video forensics for firearms, masks, and behav-
iors.” Unpublished/venue not specified, 2022.
[22], “Audio-based gun classification with yamnet and spec-
trogram features.” Unpublished/venue not specified, 2023.
41
Tags