machine learning based object recognition using cifar-10

talhaparvez4 14 views 22 slides Sep 21, 2024
Slide 1
Slide 1 of 22
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22

About This Presentation

machine learning based object recognition using cifar-10


Slide Content

MACHINE LEARNING BASED OBJECT RECOGNITION USING CIFAR-10 Made by: 2020UEC2528 MAYANK DEWAN 2020UEC2539 RUDRAKSH SHARMA 2020UEC2559 GAURAV RAJORA                  Under the supervision of: Dr. Satya Prakash Singh

CONTENTS Introduction Literature Survey Gaps Identified  Objectives Of The Project  Problem Statement Solution Methodology Results  Future  Work To Be Done Expected Outcomes Of The Work

Introduction Preschool education, also known as early childhood education or Kindergarten is typically the first stage of formal education that children receives. Its primary focus is on preparing young learners for a smooth transition into more structured and formal education. It holds immense significance in a child's development, providing a strong foundation for lifelong learning and success.  Key aspects of education in kindergarten include:  Introduction to Academic Basics: Kindergarten introduces children to basic academic concepts such as letters, numbers, shapes, and colours. It aims to familiarize them with the fundamental building blocks of reading, writing, and math.  Language and Literacy Development: Kindergarteners are exposed to language development activities that include vocabulary building, listening skills, and basic phonics. They may begin to recognize and write letters and words. Social and Emotional Skills: Kindergarten is a significant stage for the development of social and emotional skills. Children learn how to interact with their peers, share, take turns, and express their feelings in appropriate ways. Machine learning can be applied in various ways to enhance and support preschool learning, benefiting both educators and young learners. One such application is image classification which can help young learners to learn about objects like scale, pencil, ball, newspaper, books, animals, pets and learn about them. User-friendly interfaces, colourful visuals can aide learning and make it interactive and easy .

Object Recognition is a computer vision task that involves identifying and classifying objects within an image or a video stream. The goal of object recognition is to determine what objects are present in the scene and assign them to predefined categories or classes. It outputs the names or labels of the objects present in the image or video. Object Detection , on the other hand, is a related computer vision task that goes beyond object recognition. Object detection not only identifies the objects within an image or video but also provides information about their location or spatial extent in the image. It outputs both the object labels and the coordinates of bounding boxes around each detected object. Image classification using the CIFAR-10 dataset is a common computer vision task. CIFAR-10 is a dataset containing 60,000 32x32 colour images in 10 different classes, with 6,000 images per class. The objective is to train a machine learning or deep learning model to classify these images into their respective categories. Convolutional Neural Networks (CNNs) are a powerful and widely used architecture for image classification tasks involving the CIFAR-10 dataset. CNNs automatically learn to extract hierarchical features from images. They use convolutional layers to detect edges, textures, and more complex patterns in the images, which is essential for recognizing objects in CIFAR-10. CNNs are designed to capture spatial hierarchies in images. Lower layers detect basic features like edges, while higher layers combine these features to recognize more complex objects or patterns. CNNs use weight sharing in convolutional layers, which significantly reduces the number of parameters in the model. This makes it feasible to train deep networks without an impractical number of parameters. CNNs exhibit translation invariance, meaning they can recognize patterns regardless of their position in the image. This is crucial for classifying objects that can appear anywhere in a picture. Data augmentation techniques, such as image rotation, scaling, and flipping, can be easily applied to image data, increasing the diversity of the training dataset and improving model generalization.

Literature  Survey A large number of articles and research papers have been studied to gather the information regarding convolutional neural network techniques including IEEE research papers. Different image recognition models and different libraries have also been searched on Kaggle to decide the best model suitable for us. Datasets like CIFAR 10, CIFAR 100 and VGG16 were studied thoroughly to look for their advantages and disadvantages over one another. Different CNN models were considered, and ResNet-50 is utilised along with CIFAR 10. MODELS REFFERED "Learning Multiple Layers of Features from Tiny Images"     Author: Alex Krizhevsky Description: This paper introduced the CIFAR-10 dataset and presented a deep convolutional neural network (CNN) that achieved state-of-the-art performance at the time. It laid the foundation for using deep learning in image classification. Abstract: The paper primarily focuses on using deep CNNs for image classification and introduces the CIFAR-10 dataset as a benchmark for evaluating these models. The CIFAR-10 dataset comprises 60,000 32x32 color images in ten different classes, making it a suitable choice for testing deep learning architectures.

2. "Very Deep Convolutional Networks for Large-Scale Image Recognition"  Authors: Karen Simonyan, Andrew Zisserman Description: This paper introduced the VGGNet architecture, which significantly deepened the networks used for image classification. It has had a major influence on the design of deep CNNs for CIFAR-10 and other datasets. Abstract: The paper focuses on the development of deep convolutional neural network (CNN) architectures and their application to large-scale image recognition. The authors present a network called VGG (Visual Geometry Group)Net, characterized by its depth and uniformity, which makes it effective for various computer vision tasks, including image classification. 3."Residual Networks"  Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al. Description: This paper introduced ResNet, a deep neural network architecture with residual connections. ResNet revolutionized image classification by enabling the training of very deep networks, leading to improved performance on CIFAR-10 and other datasets. Abstract: The paper addresses the problem of training very deep neural networks, which are prone to vanishing or exploding gradients during training. To mitigate these issues, the authors propose the use of residual connections, which allow for the training of networks with hundreds or even thousands of layers. The resulting architecture is called a Residual Network, or ResNet.

Gaps  Identified Preschool education plays a critical role in a child's early development and lays the foundation for their future academic and social success. However, like any educational system, pre schooling  have gaps or shortcomings that can be identified and analysed . Here are some common gaps and shortcomings in preschool education - Classroom Size : The student-to-teacher ratio is a critical factor in the quality of early education. Statistics can show if classrooms are overcrowded, which can negatively impact individualized attention and learning outcomes. Quality Of Teaching:  Many preschools lack trained teachers and age-appropriate curriculum, leading to variations in the educational experience provided to young children. Access Disparities : Many children, especially those from disadvantaged backgrounds, lack access to quality preschool programs. Limited availability of preschools and financial barriers can result in unequal access to early education. Curriculum Alignment : The curriculum in some preschool programs may not be developmentally appropriate or aligned with best practices in early childhood education, impacting children's readiness for kindergarten. There is also a lack of proper ML model that focuses on improvement of elemtary education that utilises object classfication using datasets.

Objective Of The Project

Problem Statement Preschool education plays a crucial role in the cognitive, social, and emotional development of young children. However, it often faces challenges related to accessibility, quality, and individualized learning experiences. Presence of adequate number of teachers and innovative teaching methodology has also become difficult to find. According to various reports, the gross enrolment ratio (GER) for preschool education in India is low, with significant disparities between urban and rural areas. The quality of preschool education is a major concern. Many preschools lack trained teachers and age-appropriate curriculum, leading to variations in the educational experience provided to young children. I nsufficient infrastructure, including the lack of proper facilities, materials, and play equipment in preschools, affects the quality of education provided. Teachers are not able to focus on every child due to lack of time and increased number of children in a class. Therefore, there is a need for innovation and to address these issues, there is a growing interest in leveraging technology, specifically machine learning, to enhance and personalize preschool education.

Methodology Understanding the CNN CNN (convolution Neural network) is, Neural Network are a subset of machine learning, and they are at the heart of deep learning algorithms. They are comprised of node layers, containing an input layer, one or more hidden layers, and an output layer. Each node connects to another and has an associated weight and threshold. a). Convolution Layer: The convolutional layer is the core building block of a CNN, and it is where most of the computation occurs. It requires a few components, which are input data, a filter, and a feature map. Let’s assume that the given input will be colour image, which is made up of a matrix of pixels in 3D. This means the input will have 3 dimensions— height, width, and depth—which correspond to RGB in an image. We also have a feature detector, also known as a kernel or a filter, which will move across the receptive fields of the image [2], checking if the feature is present. This process is known as a convolution. The final output from the series of dot products from the input and the filter is known as a feature map, activation map, or a convolved feature. the weights in the feature detector remain fixed as it moves across the image, which is also known as parameter sharing. A parameter sharing scheme is used in convolutional layers to control the number of free parameters. It relies on the assumption that if a patch feature is useful to compute at some spatial position, then it should also be useful to compute at other positions. Denoting a single 2-dimensional slice of depth as a  depth slice , the neurons in each depth slice are constrained to use the same weights and bias.

Sometimes, the parameter sharing assumption may not make sense. This is especially the case when the input images to a CNN have some specific centred structure; for which we expect completely different features to be learned on different spatial locations. One practical example is when the inputs are faces that have been centred in the image: we might expect different eye-specific or hair-specific features to be learned in different parts of the image. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a "locally connected layer”. These include:  1. Number of filters that affects the depth of output, for example three distinct filters provides different features maps. 2 .Strides is the distance, or number of pixels the kernel moves over the input matrix. While stride values of two or greater is rare, a larger stride yields a smaller output. 3. Zero Padding used when filter don’t fit the input image. This sets all that fall outside of the input matrix to zero, producing a larger or equally sized output. There are three types of padding: Valid padding:  This is also known as no padding. In this case, the last convolution is dropped if dimensions do not align. Same padding:  This padding ensures that the output layer has the same size as the input layer. After each convolution operation, a CNN applies a Rectified Linear Unit ( ReLU ) transformation to the feature map, introducing nonlinearity to the model. this image represents the image convolution with filter.  

Illustration for how a convolutional layer operates. ReLu activation function makes the network converge much faster as it does not saturate (when x > 0) and is computationally efficient. It is defined as, f(x) = max (0, x) but it suffers from a drawback, when x < 0, during forward pass neurons remain inactive and weights are not updated during backpropagation. As a result, the network does not learn. Hence, we use leaky ReLu [8] which is defined as  

b). Pooling Layer : Pooling layer is used to reduce the size of the image along with keeping the important parameters in role. Thus, it helps to reduce the computation in the model.  1.Max Pooling we narrow down the scope and of all the features, the most important features are only considered. Thus, the problem is solved. Pooling is done in two ways Average Pooling or Max Pooling. Max Pooling is generally used [5]. It will only consider the maximum value within a kernel filter as per the size of strides and enhance the edge detection i.e., vertical and horizontal edge. It will reduce the image size and make the computational faster. But max pooling technique is not used in case of large data set as it will reduce the size of image significantly and thus change the image parameter.

2.Average Pooling: It is a technique of reducing the size of the feature maps produced by the convolutional layers in a convolutional neural network (CNN). It works by dividing the feature map into non-overlapping regions of a fixed size, usually 2x2 or 3x3, and taking the average value of the pixels within each region as the output. This way, average pooling summarizes the average presence of a feature in a region of the feature map and discards the less relevant information. Moreover, average pooling provides some degree of translation invariance to the CNN, meaning that the output does not change much if the input image is slightly shifted or rotated.

c). Fully connected : A fully connected layer in a convolutional neural network (CNN) is a layer where each neuron is connected to every neuron in the previous layer. This means that the input of a fully connected layer is a vector that contains all the values from the output of the previous layer. A fully connected layer performs a linear transformation on the input vector, followed by a non-linear activation function, such as sigmoid, tanh, or ReLU .  To illustrate how a fully connected layer works, let’s consider an example of a CNN that classifies images into 10 categories. Suppose that the input image has a size of 32 x 32 x 3 (width x height x depth), and it passes through several convolutional and pooling layers that produce an output of size 4 x 4 x 16. This output is then flattened into a vector of size 256 (4 x 4 x 16), which is the input of the first fully connected layer. The first fully connected layer has 64 neurons, so it has a weight matrix of size 256 x 64 and a bias vector of size 64. The output of this layer is another vector of size 64, which is obtained by multiplying the input vector by the weight matrix, adding the bias vector, and applying an activation function. The output vector of the first fully connected layer is then fed into the second fully connected layer, which has 10 neurons, corresponding to the 10 classes. The second fully connected layer has a weight matrix of size 64 x 10 and a bias vector of size 10. The output of this layer is another vector of size 10, which represents the class scores or probabilities for each class. The final output of the network is obtained by applying a SoftMax function to this vector, which normalizes it to sum up to one. the neuron applies a linear transformation to the input vector through a weights matrix. A non-linear transformation is then applied to the product through a non-linear activation function  f .

This Illustrates how a CNN recognize an object along with detecting of different features of object here it is koala . And making of a Feature map through convolution layer with a grid of filters .. Then again passes to more sophisticated filter to detect more general feature of object . Afterward convert the 3D convolution into a 1D Convolution Array to make a fully connected dense neural network . That dense neural network will help in detecting different variety of object with similar features to classify variety of object in more generic way.. ReLu Is used to make feature more non linear to make it more generic and handle overfitting .This is how the CNN model functions

Training Model Convolutional neural networks (CNN) are a type of deep learning that use convolutional operations rather than standard multiplication. It has been frequently employed in classification problems recently, particularly in image recognition, and can automatically extract discriminant characteristics through the training process (CNN Attributes). ResNet-50 is based on a deep residual learning framework that allows for the training of very deep networks with hundreds of layers. ResNet-50 consists of 50 layers that are divided into 5 blocks, each containing a set of residual blocks. The residual blocks allow for the preservation of information from earlier layers, which helps the network to learn better representations of the input data . Architecture of Model

Architecture Of ResNet50 The 50-layer ResNet architecture includes the following elements, as shown in the table below: A 7×7 kernel convolution  alongside 64 other kernels with a 2-sized stride. A max pooling layer  with a 2-sized stride. 9 more layers —3×3,64 kernel convolution, another with 1×1,64 kernels, and a third with 1×1,256 kernels. These 3 layers are repeated 3 times.  12 more layers  with 1×1,128 kernels, 3×3,128 kernels, and 1×1,512 kernels, iterated 4 times. 18 more layers  with 1×1,256 cores, and 2 cores 3×3,256 and 1×1,1024, iterated 6 times. 9 more layers  with 1×1,512 cores, 3×3,512 cores, and 1×1,2048 cores iterated 3 times. (up to this point the network has 50 layers) Average pooling , followed by a fully connected layer with 1000 nodes, using the softmax activation function.

Results The aim of the proposed method is getting accurate objects recognised with least losses and max accuracy. In light of this, we measured accuracy and loss. The performance indicator of accuracy is explained below and provides the percentage of successfully anticipated observations to all observations. Fetching all the losses from 10 epochs, we find the final loss and get the accuracy,   So, the observation is that the pre-trained model when trained with Neural Network gave an accuracy of around 31.6% and when that pre-trained model is trained with ResNet50 the accuracy increases to 71.7%.

 Fig-1 Accuracy after 10 epochs   Fig-2 Loss after 10 epochs   Fig 2 shows the loss of data during the training and validation process. Loss function denotes the degree of error while making predictions during training and testing process. Fig 1 shows the accuracy of data during the training and validation process. Fig [1,2] shows the model performance in the form of accuracy and loss with respect to iterations. The output demonstrates the model’s accuracy after 10 iterations.   After each iteration, the training performance progressively got better and stayed stable. In the validation set for the classification of objects (CIFAR-10 dataset), an averaged accuracy of 71% was attained after 10 iterations, or epochs. With each iteration, the loss likewise decreased, entire model took 41 seconds to predict the labels correctly.  

Future Work To Be Done

THANK YOU
Tags