“Introduction to Deep Learning and Visual AI: Fundamentals and Architectures,” a Presentation from eBay

Introduction to Deep Learning
and Visual AI: Fundamentals
and Architectures
Mohammad Haghighat
Senior Manager, CoreAI
eBay

Outline
•High level introduction to AI
•Classical vs. deep learning
•Neural networks and deep learning
•Fully connected networks
•Elements of a neural network
•Neural network training
•Convolutional neural networks (CNNs)
•Building blocks of CNNs
•CNNs (cont.)
•Applications of CNNs
•Popular CNN architectures
•Mobile CNN architectures
•Attention mechanism
•Vision transformers
•CNN vs ViT
•Conclusions
© 2025 ebay 2

© 2025 ebay 3
High-level introduction to AI
Machine
Learning (ML)
Model
person
ML Model
person
dancing on
the beach
ML Model
negative
feedback
“Nothing to love
about this
presentation.”
ML Model beginningbegining
ML Model
“Let‘s go for
lunch”
ML Model

© 2025 ebay 4
Classical learning vs deep learning
Input Data
(e.g., image)
Feature
Extraction
(e.g., edges)
Dimensionali
ty Reduction
(e.g., PCA*)
Classifier
(e.g., SVM*)
Output
Dog
Input Data
(e.g., image)
Output
*PCA: Principal Component Analysis
*SVM: Support Vector Machines

© 2025 ebay 5
What are neurons?

© 2025 ebay 6
… and what are neural networks?
a layer

© 2025 ebay 7
Neural networks as a vehicle for deep learning
Universal Approximation Theorem
A one-hidden-layer neural network with enough neurons can approximate anycontinuous
function within the given input range.
non-linear
activation function

© 2025 ebay 8
Neural network-based classifier
apple
banana
orange
color
taste
weight
shape
0.12
0.05
0.83
0
0
1
network
output
ideal
output
error/loss

© 2025 ebay 9
Neural network training
Reference
Loss and gradient descent algorithm

© 2025 ebay 10
Different model types and architectures
Fully Connected Networks
Convolutional Neural Networks
•Encoders
•UNETs
•3D CNNs
Sequential Approaches
•RNNs
•LSTMs
•GRUs
Attention-based Networks
•Transformers

© 2025 ebay 11
Image as an input data
How computer sees an edge

© 2025 ebay 12
Convolutional vs fully connected
Convolutional layer
●Capture local patterns and spatial
relationships between pixels
●Parameter efficiency: shared weights
●Better generalization: translation invariance

© 2025 ebay 13
Introduction to CNNs

© 2025 ebay 14
Building blocks of CNNs

© 2025 ebay 15
Number of parameters in a convolutional layer
Number of
parameters for a K×K
kernel:
(K ×K ×N + 1) ×M
N: input depth
M: output depth

© 2025 ebay 16
Building blocks of CNNs
Pooling layer

© 2025 ebay 17
Building blocks of CNNs
A Multi-Layer CNN

© 2025 ebay 18
Deep learning is representation learning
(a.k.a. feature learning)

© 2025 ebay 19
Applications of CNNs
Image Classification
P
dog= 0.9
P
cat= 0.1

© 2025 ebay 20
Applications of CNNs
Object Detection

© 2025 ebay 21
Applications of CNNs
Instance Segmentation

© 2025 ebay 22
Popular CNN architectures
Inception (2014)
Motivation: let the network decide what filter size to put in a layer

© 2025 ebay 23
Popular CNN architectures
GoogleNet (2014) -Top-5 Error 6.67% on ImageNet

© 2025 ebay 24
Popular CNN architectures
Residual block with a skip connection

© 2025 ebay 25
Popular CNN architectures
ResNet (2015) –Top-5 Error 3.57% on ImageNet for ResNet-152

© 2025 ebay 26
Trend of CNN-based classifiers
https://paperswithcode.com

© 2025 ebay 27
Trend of CNN-based classifiers
Comparison of popular CNN
architectures. The vertical axis
shows top 1 accuracy on
ImageNet classification. The
horizontal axis shows the number
of operations needed to classify
an image. Circle size is
proportional to the number of
parameters in the network.

© 2025 ebay 28
CNNs for edge devices
What do we want on edge?
•Low computational complexity
•Small model size for small memory
•Low energy usage
•Good enough accuracy (depends on
application)
•Deployable on embedded
processors
•Easily updatable (over-the-air)

© 2025 ebay 29
MobileNets

© 2025 ebay 30
MobileNets
Regular convolution
Number of parameters
for a K×K kernel:
K ×K ×N ×M
N: input depth
M: output depth

© 2025 ebay 31
MobileNets
Depthwiseseparable
convolution
Number of parameters:
Depthwise:
•K ×K ×N
Pointwise:
•1 ×1 ×M
Total:
•K ×K ×N+ M
N: input depth
M: output depth

© 2025 ebay 32
MobileNets
Model shrinking hyperparameter
Depth Multiplier :: Width Multiplier :: alpha :: α
To thin a network uniformly at each layer
Number of channels: M → αM
Log linear dependence between accuracy and computation

© 2025 ebay 33
EfficientNets
Let’s uniformly scale network width, depth, and resolution with a set of fixed scaling coefficients

•A mathematical mechanism that weighs the significance of each part of the input against all other
parts in the input
•Training allows the model to learn how to calculate relevance between input parts based on the
contextual content
•Removes the inductive biases we have placed on CNNs
The power of attention
Source: Tom Michiels, Synopsys, Embedded Vision Summit 2022
Input Self-Attention
Source: Dosovitskiyet al., An Image is Worth 16x16 words, ICLR 2021
© 2025 ebay 35

High-level overview of the ViT
information Input
© 2025 ebay 37
Source: Dosovitskiyet al., An Image is Worth 16x16 words, ICLR 2021

•There are open challenges…
•Requires huge datasets to train (these are large-data regime models)
•Computation and memory requirements increase quadratically with the
number of input parts
•Still computationally too expensive for edge inference
*
* Transformer models with parameter sizes between 5 and 100 M, and computational requirements between 2 and
16 GFLOPs already exist. Source https://arxiv.org/pdf/2101.01169.pdf
What’s the catch?
© 2025 ebay 38

© 2025 ebay 39
•Efficiency
•Spatial hierarchy
•Established frameworks
•Global context
•Scalability: do better with
more data and larger size
•Limited context
•Sensitivity to translation
(e.g., rotation)
•Data hungry
•Computationally intensive
CNNs vs. transformers
CNNs Transformers
Advantages
Disadvantages

•Compare and contrast the features of CNNs and transformers, such as:
•Input data representation (entire image vs patches)
•Local features vs global features
•Parameter efficiency (CNNs can achieve good performance with fewer parameters)
•Training data requirements
•Computational efficiency and memory requirements
•Interpretability (which is one easier to interpret? CNNs are thought to be easier)
What type of model should I use?
© 2025 ebay 40

Conclusions
We talked about:
•Deep neural networks and CNNs as the network of choice for computer vision
•The building blocks of CNNs: Convolution layer, pooling layer, padding, stride, etc.
•Application of CNNs in computer vision: Image classification, object detection,
segmentation, etc.
•CNN architectures: Inception, GoogleNet, ResNet
•Edge-optimized CNNs architectures: MobileNets& EfficientNets
•Attention mechanism and ViTs
Choosing the right model for an application and target hardware is crucial
for accuracy and efficiency.
© 2025 ebay 41

Resources
•EfficientNet: https://arxiv.org/abs/1905.11946
•Papers With Code: https://paperswithcode.com
•Understanding of MobileNet: https://wikidocs.net/165429
•New mobile neural network architectures https://machinethink.net/blog/mobile-architectures/
•An Analysis of Deep Neural Network Models for Practical Applications: https://arxiv.org/abs/1605.07678
•Deep Learning Equivariance and Invariance:
https://www.doc.ic.ac.uk/~bkainz/teaching/DL/notes/equivariance.pdf
•IndoMLStudent Notes: Convolutional Neural Networks (CNN) Introduction:
https://indoml.com/2018/03/07/student-notes-convolutional-neural-networks-cnn-introduction/
•Beginners Guide to Convolutional Neural Networks: https://towardsdatascience.com/beginners-guide-to-
understanding-convolutional-neural-networks-ae9ed58bb17d
•A Comprehensive Guide to Convolutional Neural Networks: https://towardsdatascience.com/a-comprehensive-
guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
•Dosovitskiyet al., An Image is Worth 16x16 words, ICLR 2021
•Tom Michiels, Synopsys, Embedded Vision Summit 2022
© 2025 ebay 43

“Introduction to Deep Learning and Visual AI: Fundamentals and Architectures,” a Presentation from eBay

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

“Introduction to Deep Learning and Visual AI: Fundamentals and Architectures,” a Presentation from eBay

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx