Accurate ECG interpretation from scanned strips is vital in resource limited settings. Prior deep learning models often ignored hierarchical relationships among EC abnormalities, limiting performance. We hypothesized that combining bimodal co-attention (across ECG images and continuous...
Introduction
Accurate ECG interpretation from scanned strips is vital in resource limited settings. Prior deep learning models often ignored hierarchical relationships among EC abnormalities, limiting performance. We hypothesized that combining bimodal co-attention (across ECG images and continuous wavelet transform maps (CWT) ) with a Graph Neural Network (GNN) to refine label relationships would enhance explainable multilabel classification and promote semantic consistency.
Methods
ECG voltages from the public PTB-XL dataset were plotted to mimic paper strips. Images were pre-processed with Hough transform for gridline removal and adaptive thresholding before binarization, from which CWT maps were generated. The model was trained on 17,144 images and tested on 2, 241 images. Parallel CNN stems and Vision Transformer (ViT) encoders extracted features from both modalities. These streams were fused using a bimodal co-attention operation to encourage cross learning and consistency. A two layer GNN refined label embeddings based on known superclass-subclass relationships, enforcing semantic hierarchy, as seen in Figure 1. Final subclass predictions were softly gated by superclass probabilities. A weighted focal loss, image augmentation and oversampling addressed class imbalance. Trained on a single RTX 4060 GPU.
Results
The model achieved a macro F1 of 0.68, AUC of 0.92 and 88% overall accuracy. It performed well across all stages; normal vs. abnormal (F1-0.82), superclass (F1-0.71) and subclass (F1-0.65) (Figure 2). The bimodal co-attention improved recognition of rare diagnoses, while GNN based label refinement enhanced overall semantic consistency and explainability across predictions. This work represents the first image based ECG model to achieve state of the art metrics across 17 diagnostic classes in the PTB-XL dataset.
Conclusions
Combining a ViT architecture and bimodal co-attention with GNN-refined label embeddings significantly advances hierarchically explainable ECG diagnosis from image based inputs, supporting boarder adoption in resource-constrained environments.
Size: 1.46 MB
Language: en
Added: Aug 30, 2025
Slides: 16 pages
Slide Content
Slot No. 13 WaveFormer: Bimodal Co-Attention and Graph-Augmented Hierarchical ECG Strip Classification Understanding AI-based ECG diagnosis from paper-like images using rhythm–morphology fusion and clinical label structure (PTB-XL dataset)
Disclosure of Financial Interests No relevant financial relationships or conflict of interests
Aims and Objectives ECGs are often printed or scanned in resource-limited settings. Digital waveform capture is not always available. Accurate interpretation remains challenging for AI models. Existing models ignore class hierarchy and ECG rhythm/morphology interplay. Goal: Build a clinically consistent, AI-powered diagnostic tool.
Related Work Prior ECG AI Approaches: 1D Signal Models (e.g., CNNs, LSTMs, ResNet ) operate on waveform data (Chen et al., IEEE JBHI, 2022) Image-Based Models convert signals into spectrograms or stylized ECGs, but rarely use paper-like ECG images Most models use flat multilabel heads or target small class subsets (e.g., 5–10 labels) ( Siontis et al., Nat Rev Cardiol , 2023) Common Limitations: Class hierarchy (e.g., MI → AMI/PMI) is ignored in label design and model logic Many models assume single-label outputs, not multilabel cardiac scenarios Few attempts exist to train on 17 PTB-XL classes, especially using realistic ECG image formats Our Contributions: First to classify all 17 PTB-XL diagnostic classes (5 superclasses , 11 subclasses, and NORM) Uses both ECG image (morphology) and CWT (rhythm) features with bidirectional co-attention Enforces diagnostic consistency via a Graph Neural Network (GNN) encoding class relationships Enables soft multilabel prediction, supporting realistic clinical co-diagnoses
Methodology: Dataset: PTB-XL Overview 17,441 ECG signals for training, 2,291 for testing. 26 total diagnostic labels: 5 superclasses, 21 subclasses. Multilabel format – more than one diagnosis per ECG is possible.
ECG Preprocessing Pipeline Raw voltages plotted to mimic real ECG printouts. Gridlines removed using Hough Transform. Waveforms enhanced with adaptive thresholding and binarized. Each lead is isolated using bounding boxes. CWT (wavelet) maps created from final image using Morlet wavelets.
ECG Pre-Processing
AI Model: WaveFormer Two inputs: ECG image + corresponding CWT map. Separate CNN blocks extract features from each input. Both streams fed into transformer models (ViT). Co attention matches rhythm and morphology info. GNN-based label embeddings maintain clinical hierarchy.
WaveFormer
Stepwise Diagnosis Prediction Stage 1 : Normal vs Abnormal – if Normal, stop here. Stage 2 : Detect major abnormality type (superclass) if sigmoid logits > 0.2. Stage 3 : Predict specific disease (subclass) only if relevant superclass is active. Soft thresholds allow multiple diagnoses if likely. Final subclass prediction = sigmoid(subclass) × sigmoid(superclass) > 0.13 Example: An ECG may show both conduction disorder and ischemia.
Training Summary & Results Batch size: 128 Optimizer: Adam, learning rate 0.0001, weight decay 0.001. Cosine annealing learning rate scheduler. Loss: Weighted Binary Focal Cross Entropy . Weighted F1 score: 0.84 | Macro AUC: 0.92 | Weighted Accuracy: 88% Strong performance across rare and common conditions.
ECG Superclass Performance
ECG All Class Performance
Implications and Conclusion AI can interpret paper-like ECG images directly. Multilabel and hierarchy-aware predictions mirror how doctors think. Model can handle overlapping diagnoses robustly. Potential to support diagnostics in low-resource or digitization-limited areas. Designed for real-world use, not just ideal datasets.