Enhancing Vision Models for Fine-Grained Classification

MohdShahvez7 53 views 12 slides Jun 16, 2024
Slide 1
Slide 1 of 12
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12

About This Presentation

Led a machine learning project focused on image classification using advanced vision
models, including Efficient Net, Mobile Net, and Vision Transformers.
• Created custom datasets for dog breeds and food categories to evaluate model
performance. Implemented and fine-tuned pretrained models, ac...


Slide Content

Enhancing Vision Models for Fine-Grained Classification Presented By: Mohd Shahvez Supervisor: Dr. Karm Veer Singh

2 Introduction Description: This work explores the capabilities and performance of various vision models in classifying images, focusing specifically on dog breeds and different food categories. It aims to provide a comparative analysis of the effectiveness of models such as EfficientNet, Vision Transformer, and MobileNet in handling these classification tasks and make some changes in the vision transformer model and evaluate the performance of the new model. Objectives: Performance Evaluation: Compare vision models on Dog Vision and Food Vision datasets. Impact Analysis: Assess the effectiveness of rotary embeddings in Vision Transformer. Insights: Identify strengths and weaknesses of each model to guide future research and applications. Gaps in Current Research: Fine-Grained Classification Small-Scale Datasets

3 Models Used:

4 Dog Breed Dataset: We created a comprehensive dataset with 20,000 images spanning 150 different dog breeds to provide a robust test for fine-grained classification tasks. 150 classes Food Vision Dataset: A smaller, more focused dataset with 100 images across 3 classes was used to evaluate model performance on limited data scenarios.

5 Approach: Transfer Learning: Utilizing pretrained models on ImageNet, fine-tuning was performed on our custom datasets to adapt the learned features for specific tasks. Rotary Positional Embeddings: A novel approach where we replaced the absolute positional embeddings in Vision Transformer with rotary positional embeddings to enhance the model's ability to capture fine-grained details. Evaluation Metrics: Train/Test Loss Train/Test Accuracy

6 Dog Vision Result Performance Overview: This slide presents the performance results of various models on the Dog Vision dataset Conclusion: The Vision Transformer with Rotary Embeddings outperforms other models in terms of both training and test accuracy. EfficientNet models offer a practical trade-off between performance and resource usage, suitable for deployment on devices with limited computational power. MobileNet, while efficient, shows lower accuracy, indicating a need for further optimization or use in less complex tasks.

7 Food vision Result Performance Overview: This slide presents the performance results of various models on the Food Vision dataset Conclusion: The Vision Transformer excels in both training and test accuracy, making it a strong candidate for food image classification tasks. EfficientNet models provide a balanced approach, suitable for applications where computational resources are a concern. MobileNet, despite its efficiency, may require further optimization for tasks requiring higher accuracy.

8 Prediction from best model Vision Transformer on Dog Vision dataset Vision Transformer on Food Vision dataset

9 Explanation of Rotary Embeddings: Definition: Rotary embeddings are a type of positional encoding that enhances the self-attention mechanism in Vision Transformers. Mechanism: They encode relative position information, allowing the model to better capture spatial relationships in images. Effect on Vision Transformer’s Performance: Improved Accuracy: Dog Vision Dataset: Test accuracy improved from ~84% to ~89% . Reduced Test Loss: Significantly lower test loss, indicating better model generalization. Benefits: Enhanced Spatial Understanding: Rotary embeddings improve the model's ability to understand and process fine-grained spatial details. Consistency: Consistently high performance across different datasets and tasks. Impact of Rotary IMBEDDING Accuracy: Loss :

10 Expand Dataset Scope: Additional Classes: Incorporate more classes into both datasets to test model scalability and robustness. Diverse Images: Include more diverse and challenging images to further evaluate model performance. Model Enhancements: Advanced Architectures: Experiment with other state-of-the-art architectures and hybrid models. Optimized Training: Explore techniques for reducing training time and computational costs without compromising accuracy. Real-World Applications: Deployment: Implement models in real-world applications such as mobile apps and automated systems. User Interaction: Test model performance with real-time user interaction and feedback to refine models further. Additional Research: Fine-Grained Tasks: Investigate the performance of Vision Transformers with rotary embeddings on other fine-grained classification tasks. Efficiency Optimization: Focus on reducing the computational requirements of Vision Transformers for more efficient deployment. Future work

11 References Dosovitskiy, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 770-778. Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. International Conference on Machine Learning (ICML), 6105-6114. Howard, A. G., et al. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861. Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 5998-6008. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030. Touvron, H., et al. (2021). Training data-efficient image transformers & distillation through attention. Proceedings of the 38th International Conference on Machine Learning (ICML), 10347-10357. Ramachandran, P., et al. (2019). Stand-Alone Self-Attention in Vision Models. Advances in Neural Information Processing Systems (NeurIPS), 68-80.

12