Explore the groundbreaking project by Agin Anuradha that harnesses the power of artificial intelligence to generate descriptive captions for images. This presentation delves into the technology and methodologies behind the image caption generator, demonstrating its potential to enhance accessibility...
Explore the groundbreaking project by Agin Anuradha that harnesses the power of artificial intelligence to generate descriptive captions for images. This presentation delves into the technology and methodologies behind the image caption generator, demonstrating its potential to enhance accessibility, automate content creation, and improve user engagement. Discover how this innovative tool combines computer vision and natural language processing to accurately interpret and describe visual content, and learn about its practical applications and future possibilities. Join us in understanding how this project is setting new standards in visual content interpretation. for more information visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Size: 970.59 KB
Language: en
Added: Sep 13, 2024
Slides: 16 pages
Slide Content
Image Caption Generator
Introduction Objective: To build a model that generates captions for images using a combination of deep learning and attention mechanisms. Key Points: Image feature extraction with VGG16 (pre-trained model). Caption generation using encoder-decoder architecture with attention. Evaluation using BLEU scores.
Dataset Dataset: Flickr8k Dataset Images: 8000 images of various scenes. Captions: 5 captions per image. Source: Link to the dataset. Preprocessing: Images resized to 224x224. Captions cleaned and tokenized.
VGG16 Feature Extraction: Using VGG16 model to extract image features. Last classification layer removed. Output: 4096-dimension feature vector for each image.
Caption Cleaning: Convert to lowercase. Remove non-alphabetic characters. Add startseq and endseq tokens. Tokenization: Convert text to sequences using Tokenizer. Vocabulary size calculated. Maximum caption length determined. Text PreProcesing
Functionality: Generates training data batches. Outputs pairs of image features and tokenized captions. Reduces memory usage by avoiding loading all data at once. Data Generator
Model Architecture Encoder-Decoder with Attention: Encoder: Processes image features using Dense and LSTM layers. Attention Mechanism: Aligns image features with corresponding words in the caption. Decoder: Generates the next word in the caption using LSTM.
Attention Mechanism Functionality: Focuses on relevant parts of the image when generating each word. Attention scores are calculated using the Dot layer. Importance: Helps the model align visual features with the corresponding text more effectively.
Model Training Training Setup: Epochs: 50 Batch Size: 32 Loss Function: Categorical Crossentropy Optimizer: Adam Validation: Split the dataset (90% training, 10% validation).
Results: Caption Generation
Evaluation with BLUE Scores Evaluation Metric: BLEU-1: Measures unigram precision. BLEU-2: Measures bigram precision. Scores:
Challenges Challenges Encountered : Handling long captions with complex dependencies. Attention model tuning. BLEU score sensitivity to short captions.
Deployment on Streamlit
Conclusion Key Conclusion: Successfully built an image captioning model using VGG16 and LSTM with attention. Achieved meaningful results as evaluated by BLEU scores. Future Work: Fine-tuning: Experiment with fine-tuning the captioning model architecture and hyperparameters for improved performance. Dataset Expansion: Incorporate additional datasets to increase the diversity and complexity of the trained model for example we can train the model on Flickr30k dataset.