“Introduction to Deep Learning and Visual AI: Fundamentals and Architectures,” a Presentation from eBay

embeddedvision 54 views 43 slides Sep 15, 2025
Slide 1
Slide 1 of 43
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43

About This Presentation

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2025/09/introduction-to-deep-learning-and-visual-ai-fundamentals-and-architectures-a-presentation-from-ebay/

Mohammad Haghighat, Senior Manager for CoreAI at Bay, presents the “Introduction to Deep Learning and ...


Slide Content

Introduction to Deep Learning
and Visual AI: Fundamentals
and Architectures
Mohammad Haghighat
Senior Manager, CoreAI
eBay

Outline
•High level introduction to AI
•Classical vs. deep learning
•Neural networks and deep learning
•Fully connected networks
•Elements of a neural network
•Neural network training
•Convolutional neural networks (CNNs)
•Building blocks of CNNs
•CNNs (cont.)
•Applications of CNNs
•Popular CNN architectures
•Mobile CNN architectures
•Attention mechanism
•Vision transformers
•CNN vs ViT
•Conclusions
© 2025 ebay 2

© 2025 ebay 3
High-level introduction to AI
Machine
Learning (ML)
Model
person
ML Model
person
dancing on
the beach
ML Model
negative
feedback
“Nothing to love
about this
presentation.”
ML Model beginningbegining
ML Model
“Let‘s go for
lunch”
ML Model

© 2025 ebay 4
Classical learning vs deep learning
Input Data
(e.g., image)
Feature
Extraction
(e.g., edges)
Dimensionali
ty Reduction
(e.g., PCA*)
Classifier
(e.g., SVM*)
Output
Dog
Input Data
(e.g., image)
Output
*PCA: Principal Component Analysis
*SVM: Support Vector Machines

© 2025 ebay 5
What are neurons?

© 2025 ebay 6
… and what are neural networks?
a layer

© 2025 ebay 7
Neural networks as a vehicle for deep learning
Universal Approximation Theorem
A one-hidden-layer neural network with enough neurons can approximate anycontinuous
function within the given input range.
non-linear
activation function

© 2025 ebay 8
Neural network-based classifier
apple
banana
orange
color
taste
weight
shape
0.12
0.05
0.83
0
0
1
network
output
ideal
output
error/loss

© 2025 ebay 9
Neural network training
Reference
Loss and gradient descent algorithm

© 2025 ebay 10
Different model types and architectures
Fully Connected Networks
Convolutional Neural Networks
•Encoders
•UNETs
•3D CNNs
Sequential Approaches
•RNNs
•LSTMs
•GRUs
Attention-based Networks
•Transformers

© 2025 ebay 11
Image as an input data
How computer sees an edge

© 2025 ebay 12
Convolutional vs fully connected
Convolutional layer
●Capture local patterns and spatial
relationships between pixels
●Parameter efficiency: shared weights
●Better generalization: translation invariance

© 2025 ebay 13
Introduction to CNNs

© 2025 ebay 14
Building blocks of CNNs

© 2025 ebay 15
Number of parameters in a convolutional layer
Number of
parameters for a K×K
kernel:
(K ×K ×N + 1) ×M
N: input depth
M: output depth

© 2025 ebay 16
Building blocks of CNNs
Pooling layer

© 2025 ebay 17
Building blocks of CNNs
A Multi-Layer CNN

© 2025 ebay 18
Deep learning is representation learning
(a.k.a. feature learning)

© 2025 ebay 19
Applications of CNNs
Image Classification
P
dog= 0.9
P
cat= 0.1

© 2025 ebay 20
Applications of CNNs
Object Detection

© 2025 ebay 21
Applications of CNNs
Instance Segmentation

© 2025 ebay 22
Popular CNN architectures
Inception (2014)
Motivation: let the network decide what filter size to put in a layer

© 2025 ebay 23
Popular CNN architectures
GoogleNet (2014) -Top-5 Error 6.67% on ImageNet

© 2025 ebay 24
Popular CNN architectures
Residual block with a skip connection

© 2025 ebay 25
Popular CNN architectures
ResNet (2015) –Top-5 Error 3.57% on ImageNet for ResNet-152

© 2025 ebay 26
Trend of CNN-based classifiers
https://paperswithcode.com

© 2025 ebay 27
Trend of CNN-based classifiers
Comparison of popular CNN
architectures. The vertical axis
shows top 1 accuracy on
ImageNet classification. The
horizontal axis shows the number
of operations needed to classify
an image. Circle size is
proportional to the number of
parameters in the network.

© 2025 ebay 28
CNNs for edge devices
What do we want on edge?
•Low computational complexity
•Small model size for small memory
•Low energy usage
•Good enough accuracy (depends on
application)
•Deployable on embedded
processors
•Easily updatable (over-the-air)

© 2025 ebay 29
MobileNets

© 2025 ebay 30
MobileNets
Regular convolution
Number of parameters
for a K×K kernel:
K ×K ×N ×M
N: input depth
M: output depth

© 2025 ebay 31
MobileNets
Depthwiseseparable
convolution
Number of parameters:
Depthwise:
•K ×K ×N
Pointwise:
•1 ×1 ×M
Total:
•K ×K ×N+ M
N: input depth
M: output depth

© 2025 ebay 32
MobileNets
Model shrinking hyperparameter
Depth Multiplier :: Width Multiplier :: alpha :: α
To thin a network uniformly at each layer
Number of channels: M → αM
Log linear dependence between accuracy and computation

© 2025 ebay 33
EfficientNets
Let’s uniformly scale network width, depth, and resolution with a set of fixed scaling coefficients

© 2025 ebay 34
EfficientNets
Note: the baseline B0 architecture is
designed using neural architecture
search (NAS).

•A mathematical mechanism that weighs the significance of each part of the input against all other
parts in the input
•Training allows the model to learn how to calculate relevance between input parts based on the
contextual content
•Removes the inductive biases we have placed on CNNs
The power of attention
Source: Tom Michiels, Synopsys, Embedded Vision Summit 2022
Input Self-Attention
Source: Dosovitskiyet al., An Image is Worth 16x16 words, ICLR 2021
© 2025 ebay 35

A more generalized learning algorithm
© 2025 ebay 36

High-level overview of the ViT
information Input
© 2025 ebay 37
Source: Dosovitskiyet al., An Image is Worth 16x16 words, ICLR 2021

•There are open challenges…
•Requires huge datasets to train (these are large-data regime models)
•Computation and memory requirements increase quadratically with the
number of input parts
•Still computationally too expensive for edge inference
*
* Transformer models with parameter sizes between 5 and 100 M, and computational requirements between 2 and
16 GFLOPs already exist. Source https://arxiv.org/pdf/2101.01169.pdf
What’s the catch?
© 2025 ebay 38

© 2025 ebay 39
•Efficiency
•Spatial hierarchy
•Established frameworks
•Global context
•Scalability: do better with
more data and larger size
•Limited context
•Sensitivity to translation
(e.g., rotation)
•Data hungry
•Computationally intensive
CNNs vs. transformers
CNNs Transformers
Advantages
Disadvantages

•Compare and contrast the features of CNNs and transformers, such as:
•Input data representation (entire image vs patches)
•Local features vs global features
•Parameter efficiency (CNNs can achieve good performance with fewer parameters)
•Training data requirements
•Computational efficiency and memory requirements
•Interpretability (which is one easier to interpret? CNNs are thought to be easier)
What type of model should I use?
© 2025 ebay 40

Conclusions
We talked about:
•Deep neural networks and CNNs as the network of choice for computer vision
•The building blocks of CNNs: Convolution layer, pooling layer, padding, stride, etc.
•Application of CNNs in computer vision: Image classification, object detection,
segmentation, etc.
•CNN architectures: Inception, GoogleNet, ResNet
•Edge-optimized CNNs architectures: MobileNets& EfficientNets
•Attention mechanism and ViTs
Choosing the right model for an application and target hardware is crucial
for accuracy and efficiency.
© 2025 ebay 41

© 2025 ebay 42
Any questions?
dog: 97%

Resources
•EfficientNet: https://arxiv.org/abs/1905.11946
•Papers With Code: https://paperswithcode.com
•Understanding of MobileNet: https://wikidocs.net/165429
•New mobile neural network architectures https://machinethink.net/blog/mobile-architectures/
•An Analysis of Deep Neural Network Models for Practical Applications: https://arxiv.org/abs/1605.07678
•Deep Learning Equivariance and Invariance:
https://www.doc.ic.ac.uk/~bkainz/teaching/DL/notes/equivariance.pdf
•IndoMLStudent Notes: Convolutional Neural Networks (CNN) Introduction:
https://indoml.com/2018/03/07/student-notes-convolutional-neural-networks-cnn-introduction/
•Beginners Guide to Convolutional Neural Networks: https://towardsdatascience.com/beginners-guide-to-
understanding-convolutional-neural-networks-ae9ed58bb17d
•A Comprehensive Guide to Convolutional Neural Networks: https://towardsdatascience.com/a-comprehensive-
guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
•Dosovitskiyet al., An Image is Worth 16x16 words, ICLR 2021
•Tom Michiels, Synopsys, Embedded Vision Summit 2022
© 2025 ebay 43