[NS][Lab_Seminar_240624]Graph Neural Networks for End-to-End Information Extraction from Handwritten Documents.pptx
thanhdowork
66 views
12 slides
Jun 24, 2024
Slide 1 of 12
1
2
3
4
5
6
7
8
9
10
11
12
About This Presentation
Graph Neural Networks for End-to-End Information Extraction from Handwritten Documents
Size: 1.49 MB
Language: en
Added: Jun 24, 2024
Slides: 12 pages
Slide Content
Graph Neural Networks for End-to-End Information Extraction from Handwritten Documents Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/06/24 Yessine Khanfir et al. WACV 2024
Introduction Paper documents exist in different forms and hold valuable information Adoption of automated Information Extraction systems is necessary to process these documents Information extraction approaches from document images: Two-stage: transforms document image into textual representation, apply NLP to parse the output text and extract the named entity tags End-to-end architecture: simultaneous recognition of text and Name Entity annotation, or direct identification of NEs on the image level without requiring recognition step Main contributions: Sparse Graph Transformer Encoder: propose a variant of Graph Transformer to encode input sequences of visual features, leverage graph structures to flexibly define dynamic scope of attention, changes according to position of indexed feature vector and consequently reduce the computational cost Cross-GCN-based Decoder: reinforce the alignment of visual features to characters and NE tages. Cross graphs combine the outputs of last SGTE layer and Multi-Head Attention block, improve representation learning
Overview Figure 1. Overview of the proposed architecture: The backbone produces a stack of feature maps from the input image, that will be used to construct the encoder’s initial graph (1), the initialization of the graph nodes and edges is illustrated in Figure 2. The graph is then fed to the SGTE to output an updated representation of the feature vectors (2). The ground truth sequence goes into the MMHA block (3). The updated representations of the visual features are stacked back in a sequential form while preserving their initial order (4), then fed as input to the MHA block along with the output of the MMHA (5). The output of the MHA (N2) and the output of the last encoder layer (N1) are used to construct the decoder initial graph (6), which will be fed to the Cross-GCN, to reinforce the alignment operation of visual features to characters and NEs (7). Both signals emerging from the GCN and the MHA are combined with a direct sum, then passed to a Softmax layer to perform the predictions.
Overview Figure 2. Minimized illustration of the graph construction method: The Backbone outputs, for each image, 256 superposed feature maps, each of size 8 × 32. Feature vectors are retrieved from the 3rd dimension of the collection of feature maps, to preserve their original position in the image and their correspondence to characters. Each feature vector becomes a node representation in an undirected graph that includes self-connections
Sparse Graph Transformer Encoder Adopt generalization of transformers to graph structures [7] Real size of (F) is 8x32x256, use to create initial graph of 8x32 nodes, each initially represented by a vector of size 256 Use sparse graph that customize the scope of attention of each element of the feature maps Categorize visual features into 2 types: First line of (F): connected to all neighbors on same and next line Last line of (F): connect to all neighbors on same and previous line Edge is connection between nodes Through SGTE layers to update node representations over attention heads
Cross-GCN-based Decoder MMHA block models the semantic relations between elements of ground truth sequence Output then fed to MHA block , align input elements to target sequence Construct a directed graph of nodes from output of last SGTE layer (N1) and node from output of MHA block (N2) To maintain masking effect of MMHA block, N2 nodes are not connected to each other, only linked to all nodes in N1 Decoder’s initial graph fed to a 2-layers GCN (Cross-GCN), each node from MHA will be represented by weighted sum over N1 nodes, update of representation: non-linearity weight coefficient multiplied with initial representation weight matrix node degree regularization Cross-GCN output then combined with output of MHA by sum for robust alignment of visual features to characters and NE tags Loss function: predicted sequence target sequence sequence length number of unique words in vocabulary predicted probability
Experiments Datasets Esposalles 125 handwritten pages, 1221 marriage records Each record is composed of several text lines giving information on husband, wife, their parent’s names, occupations, locations, civil states 872 records for training, 96 for validation, 253 for test IAM 13353 text lines and 115320 words spread across 1539 scanned text pages 6161, 966 and 2915 lines for train, valid and test 6 categories: Location, Time, Cardinal, Nationalities, Person, Organization
Experiments Configuration and metrics RestNet-50 as backbone and single attention head for encoding part Decoder uses one attention head and 2 GCN layers Adam optimizer with LR = 0.0001 Metrics: F1
Results Ablation study
Results
Results
Conclusion Introduce end-to-end encoder-decoder architecture, take part in contentious debate comparing multi-stage versus single-stage NER from handwritten documents 2 main contributions: Combination of 2 powerful representation learning components, GNNs and attention mechanism, led to an improved adaptation of textual to visual information Sparse version of Graph Transformer as encoder, avoid unnecessary attention computations