TrOCR_ Transformer-Based Optical Character Recognition with Pre-trained Models.pptx

nattkorat 109 views 23 slides Jul 31, 2024
Slide 1
Slide 1 of 23
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23

About This Presentation

Paper review on TrOCR


Slide Content

TrOCR : Transformer-Based Optical Character Recognition with Pre-trained Models Author : Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei Presented by : KORAT Natt Lecturer : Dr. KONG Phutphalla

Table of Contents Introduction Related Works Model Architecture Experiment and Evaluation Conclusion ‹#›

1. Introduction Optical Character Recognition is the mechanical conversion of images of typed, handwritten or printed text into machine readable text. ‹#›

1. Introduction ‹#› TrOCR YOLOs DBNet Object detection

1. Introduction The contribution of this paper are: End-to-end Transformer-based OCR model for text recognition with pre-trained CV and NLP models. State-of-the-art results with a standard Transformer-based encoder-decoder model. Models and code are publicly available: https://github.com/microsoft/unilm/tree/master/trocr ‹#›

2. Related Works Shi, Bai, and Yao (2016), proposed a standard CRNN by using CNN to extract visual features and convert them into sequence, RNN is used to predict the pre-frame labels, and CTC decoding to remove the repeated symbols. ‹#› Shi et al. (2016)

2. Related Works Seq2Seq model (Zhang et al. 2020b; Wang et al. 2019; Sheng, Chen, and Xu 2019; Bleeker and de Rijke 2019; Lee et al. 2020; Atienza 2021; Chen et al. 2021) are attracting more attention, especially after the arrival of Transformer architecture . ‹#› Wang et al. (2019)

2. Related Works Rectification was introduced to handle the perspective distortion that causes the text appear in irregular shapes (Shi et al. 2016; Baek et al. 2019; Litman et al. 2020; Shi et al. 2018; Zhan and Lu 2019). ‹#› Shi et al. (2016)

3. Model Architecture ‹#› Encoder Receives an input image Composes image into batch of Same as ViT, DeiT “[CLS]” is keeping for image classification. Decoder Key, Value output Queries Decoder Input Hidden states from decoder are projected by linear layer

3. Model Architecture ‹#›

3. Model Architecture ‹#› Model Initialization DeiT and BEiT models are used for encoder initialization RoBERTa and MiniLM are used to initialize the decoder. The structure of the decoders do not precisely match since both of them are only the decoder of the Transformer architecture. The decoders with the RoBERTa and MiniLM are initialized by manually setting the corresponding parameter mapping.

3. Model Architecture ‹#› Task Pipeline Given a textline image , the model extracts the virtual features and predicts the WordPiece token . The sequence of ground truth tokens is followed by an “[EOS]” token. During the training, the start of generation is added the “[BOS]” token by shifting the sequence backward by one place. Cross-entropy loss is used to supervise the shifted ground truth fed into decoder with the original ground truth.

3. Model Architecture ‹#› Pre-training Synthesizing a large-scale dataset Small Dataset Hundreds millions of printed textline images (1) Printed , (2) handwritten , millions of textline images each Initialized by first stage model Output of the TrOCR Models are based on Byte Pair Encoding (BPE) and SentencePiece .

3. Model Architecture ‹#› Data Augmentation Six techniques of image transformations + original are taken for printed and handwritten datasets with equal possibilities for each sample. Random rotate (-10 to 10 degree) Gaussian blurring Image dilation Image erosion Downscaling Underlining Scene text datasets are augmented with RandAugment technique, inversion, curving, noise, distortion, rotation, etc.

4. Experiments and Evaluations ‹#› Data 684M textlines from 2M pages of PDF-files from Internet 17.9M textlines from from 5.43k handwritten fonts synthesized by TRDG + IIIT-HWS dataset. 53k receipt images + 1M textlines generated from TRDG with 2 receipt fonts MJSynth (MJ) and SynthText (ST) 16M text images for scene text . Benchmarks Scanned Receipts OCR and Information Extraction (SROIE), 626 training + 361 testing IAM Handwriting Database, 6.16k lines/747 forms + 966 lines/155 forms + 2.92k lines/336 forms from training, testing, validation respectively. IIIT5K-3000, SVT-647, IC13-857, IC13-1015, IC15-1811, IC15-2077, SVTP-645, and CT80-288, scene text datasets are used to evaluate the capabilities of TrOCR.

4. Experiments and Evaluations ‹#› Settings 32 V100 GPUs with the memory of 32GBs , for pre-training 8 V100 GPUs , for fine-tuning Batch size is set to 2,048, lr=5e-5 BPE and SentencePiece tokenizer, for tokenize the textlines to wordpieces . Images fixed to 384x384 resolution, 16x16 patch size for DeiT and BEiT encoders . Encoders Decoders

4. Experiments and Evaluations ‹#› Results

4. Experiments and Evaluations ‹#› Results

4. Experiments and Evaluations ‹#› Results

‹#›

4. Conclusion ‹#› End-to-end Transformer-based OCR model for text recognition with pre-trained models. TrOCR does not rely on CNN models for image understanding. Experiment results show that TrOCR achieves stat-of-the-art results.

References ‹#› Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. arXiv (Cornell University) . https://doi.org/10.48550/arxiv.1706.03762 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv (Cornell University) . https://doi.org/10.48550/arxiv.2010.11929 Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2020). Training data-efficient image transformers & distillation through attention. arXiv (Cornell University) . https://doi.org/10.48550/arxiv.2012.12877 Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv (Cornell University) . https://doi.org/10.48550/arxiv.1907.11692 Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. arXiv (Cornell University) . https://doi.org/10.48550/arxiv.2002.10957

Thank You!