INDUSTRIAL TRAINING Submitted in partial fulfilment of the Requirements for the award of the degree Of Bachelor of Technology In Computer Science & Engineering By: DAMANPREET KAUR (05913202719/ CSE1/ 2019) Department of Computer Science & Engineering Guru Tegh Bahadur Institute of Technology Guru Gobind Singh Indraprastha University
INTRODUCTION What is Machine Learning? Machine Learning is a system of computer algorithms that can learn from examples through self-improvement without being explicitly coded by a programmer. Machine learning is a part of artificial intelligence which combines data with statistical tools to predict an output that can be used to make actionable insights. What is Deep Learning? Deep learning is based on the branch of machine learning, which is a subset of artificial intelligence. Since neural networks imitate the human brain and so deep learning will do. In deep learning, nothing is programmed explicitly. Basically, it is a machine learning class that makes use of numerous nonlinear processing units so as to perform feature extraction as well as transformation. The output from each preceding layer is taken as input by each one of the successive layers.
CNN ( C onvolutional neural network ) CNN is a Deep Learning algorithm which takes in an input image and assigns importance (learnable weights and biases) to various aspects/objects in the image, which helps it differentiate one image from the other. One of the most popular applications of this architecture is image classification. The neural network consists of several convolutional layers mixed with nonlinear and pooling layers. After a series of convolutional, nonlinear and pooling layers, it is necessary to attach a fully connected layer. This layer takes the output information from convolutional networks. CNN-LSTM ARCHITECTURE The CNN-LSTM architecture involves using CNN layers for feature extraction on input data combined with LSTMs to support sequence prediction. This model is specifically designed for sequence prediction problems with spatial inputs, like images or videos. They are widely used in Activity Recognition, Image Description, Video Description and many more.
INTRODUCTION TO PROJECT Image caption generator is a process of recognizing the context of an image and annotating it with relevant captions using deep learning, and computer vision. It includes the labeling of an image with English keywords with the help of datasets provided during model training. The goal of image captioning is to convert a given input image into a natural language description. The task of image captioning can be divided into two modules logically – Image based model — Extracts the features of our image. Language based model — which translates the features and objects extracted by our image based model to a natural sentence. For our image based model, we use CNN and for language based model, we use LSTM.
WORKFLOW OF THE PROJECT Perform Data Cleaning This method is used to clean the data by taking all descriptions as input. While dealing with textual data we need to perform several types of cleaning including uppercase to lowercase conversion, punctuation removal, and removal of the number containing words. 2. Loading dataset for model training We have used Kaggle ( flickr ) for our dataset which contains 8000 images and 5 captions related to each image. It stores the captions for every image from the list of photos to a dictionary. For the ease of the LSTM model in identifying the beginning and ending of a caption, we append the and identifier with each caption.
3. Tokenizing the vocabulary Machines are not familiar with complex English words so, to process model’s data they need a simple numerical representation. That’s why we map every word of the vocabulary with a separate unique index value. 4. Define the CNN-LSTM model From the Functional API, we will use the Keras Model in order to define the structure of the model. It includes: Feature Extractor –With a dense layer, it will extract the feature from the images. Sequence Processor – Followed by the LSTM layer, the textual input is handled by this embedded layer. Decoder – We will merge the output of the above two layers and process the dense layer to make the final prediction.
CONCLUSION The project “Caption an Image” has been developed as per the requirement specification. It has been developed in Machine Learning with the use of libraries like TensorFlow, Pandas, etc. and technologies such as LSTM, Regression etc. The complete system is thoroughly tested with the availability of data and throughput reports which are prepared manually. These are found to be more accurate because of availability of information from various levels. This design is so flexible that any new modules can be incorporated easily.
FUTURE SCOPE OF PROJECT Visually Impaired: This project can be used for the visually impaired people as it can generate captions for them by clicking or understanding the surroundings images and help them know their surroundings better. Social Media: Using this model in Social Media, we can help it in various different social media sites and can be used to implement new features such as new filters, also if a tourist wants to know new things they can just scan the item and will get the generated caption and research on that a lot better. It can be used in a daily life event very easily and effectively. NLP Applications: Furthermore, this model can also be used in various applications of Natural Learning Processing, like in the case of Digital Image Processing, it can be used to research and train the big models.