Logo Teach Multimodal LLMs to Comprehend Electrocardiographic Images

1The Ohio State University, 2Carnegie Mellon University
*Equal Contribution

Abstract

The electrocardiogram (ECG) is an essential non-invasive diagnostic tool for assessing cardiac conditions. Existing automatic interpretation methods suffer from limited generalizability, focusing on a narrow range of cardiac conditions, and typically depend on raw physiological signals, which may not be readily available in resource-limited settings where only printed or digital ECG images are accessible. Recent advancements in multimodal large language models (MLLMs) present promising opportunities for addressing these challenges. However, the application of MLLMs to ECG image interpretation remains challenging due to the lack of instruction tuning datasets and well-established ECG image benchmarks for quantitative evaluation. To address these challenges, we introduce ECGInstruct, a comprehensive ECG image instruction tuning dataset of over one million samples, covering a wide range of ECG-related tasks from diverse data sources. Using ECGInstruct, we develop PULSE, an MLLM tailored for ECG image comprehension. In addition, we curate ECGBench, a new evaluation benchmark covering four key ECG image interpretation tasks across nine different datasets. Our experiments show that PULSE sets a new state-of-the-art, outperforming general MLLMs with an average accuracy improvement of 15% to 30%. This work highlights the potential of PULSE to enhance ECG interpretation in clinical practice.


Figure 1. Overview of PULSE results. The PULSE demonstrates superior performance across multiple in-domain and out-of-domain datasets on our constructed ECGBench compared with advanced proprietary MLLMs (e.g., GPT-4o). Notably, the proprietary MLLMs often fail to accurately interpret ECG images, generating well-structured and contextually relevant responses but ultimately incorrect (with errors highlighted in red) compared to the ground truth diagnosis.

Dataset: ECGInstruct

Figure 2. ECGInstruct: a list of diverse and large-scale instruction tuning datasets for ECG image interpretation. (1) ECG images are synthesized from raw signal recordings with various distortions that mimic real-world printed ECG images. (2) ECGInstruct is curated based on clinician-defined ECG-related tasks, original diagnosis and clinical reports, and diverse task types. Additional quality checking is applied to filter lower-scored instructions.


Summary of ECGInstruct

Source Dataset Task Type # Samples
PTB-XL Feature Close/Open/Fill/MCQ 30K
Rhythm Close/Open/Fill/MCQ 36K
Morphology Close/Open/Fill/MCQ 67K
Report Open 16K
ECG-QA Feature Close 40K
Rhythm Close 9K
Morphology Close 90K
MIMIC-IV-ECG Feature Close/Open/Fill/MCQ 29K
Rhythm Close/Open/Fill/MCQ 115K
Morphology Close/Open/Fill/MCQ 169K
Report Open 487K
CODE-15% Feature Close 22K
Rhythm Close 14K
Morphology Close 31K
Total 1.2M

Benchmark: ECGBench

Figure 3. The data curation process for ECGBench. There are four key tasks involved: (1) two repurposed tasks (abnormality detection and report generation) derived from existing ECG datasets, where ECG images are synthesized from raw signals, and queries/answers are extracted based on diagnostic and clinical reports; (2) Two newly developed tasks using external resources, where ECG images and associated questions and answers are collected and generated from real-world sources.


Summary of ECGBench

Evaluation Dataset Task Type # Samples In-Domain?
PTB-XL Super Abnormality Detection Close-ended 2,082 YES
PTB-XL Report Report Generation Open-ended 500 YES
CODE-15% Abnormality Detection Close-ended 1,400 YES
ECG-QA Abnormality Detection Close-ended 1,317 YES
CPSC 2018 Abnormality Detection Close-ended 2,061 NO
CSN Abnormality Detection MCQ (8-option) 1,611 NO
G12EC Abnormality Detection MCQ (8-option) 2,026 NO
MMMU ECG Multimodal Understanding MCQ (4-option) 200 NO
ECG Arena Multi-turn Conversation Open-ended 50 NO

Results

Datasets PTB-XL Super PTB-XL Report CODE-15% ECG-QA
Metric AUC F1 HL Report Score AUC F1 HL Accuracy
Random 50.3 33.2 50.1 0 48.8 15.0 32.1 16.2
Domain-specific Methods
METS - 65.7† - N/A - - - N/A
MERL 74.2† - - N/A - - - N/A
ST-MEM 71.4† - - N/A - - - N/A
ECG-GPT 69.5* 53.9* 20.1* 47.8* 68.9* 40.1* 17.4* N/A
Proprietary MLLMs
GPT-4o 55.6 28.3 26.2 50.2 59.9 24.9 15.7 35.2
GPT-4o mini 52.0 20.4 31.7 37.1 57.5 22.0 15.1 14.9
Gemini 1.5 Pro 50.7 15.3 27.9 35.9 56.7 20.0 15.9 33.2
Claude 3.5 Sonnet 54.0 27.5 29.6 43.7 58.3 20.3 17.8 34.2
Open-source MLLMs
LLaVA-Med 50.0 12.3 28.1 24.3 69.2 27.0 33.4 29.5
LLaVA-1.5-7B 50.0 12.3 28.1 27.2 63.9 19.2 25.3 25.2
LLaVA-1.5-13B 50.0 35.2 48.4 20.7 53.9 13.1 13.6 21.2
LLaVA-1.6-Vicuna-7B 50.0 15.8 29.4 16.5 50.1 1.0 13.6 13.3
LLaVA-1.6-Vicuna-13B 50.0 20.1 38.3 5.9 53.0 3.6 16.6 22.0
LLaVA-1.6-34B 50.2 19.9 36.0 17.0 57.2 12.8 16.6 22.4
LLaVA-OneVision-7B 49.8 11.4 34.5 30.0 58.7 17.0 20.6 20.4
LLaVA-OneVision-72B 50.6 29.6 50.4 40.6 52.3 7.0 13.1 25.0
Deepseek-VL-Chat-7B 50.9 15.7 27.9 15.6 63.7 27.5 22.4 21.1
Idefics2-8B 50.7 21.9 31.2 10.6 49.0 17.9 47.9 26.1
Mantis-8B-siglip-Llama3 50.6 20.4 30.0 16.0 57.5 17.9 15.7 23.8
MiniCPM-V-2.6 49.0 37.7 63.8 15.4 56.6 25.3 22.0 20.8
Phi-3-Vision-128k-Instruct 50.0 29.6 48.4 20.2 69.6 22.6 38.8 28.4
Qwen2-VL-7B 51.3 22.4 30.8 43.0 60.7 24.8 20.5 20.4
Qwen2-VL-72B 54.0 28.3 30.2 48.9 60.6 23.6 16.1 23.7
InternVL2-8B 50.6 14.3 27.8 38.1 55.8 16.1 17.7 22.3
InternVL2-40B 51.2 18.7 34.6 41.8 56.7 16.2 17.4 18.2
InternVL2-Llama3-76B 50.4 9.4 35.6 41.4 59.0 20.2 20.5 21.8
PULSE-7B (Ours) 82.4 74.8 11.0 61.3 90.7 85.4 5.0 73.8
Δ over best proprietary MLLM +27 +47 +15 +11 +30 +61 +10 +39
Δ over best open-source MLLM +28 +37 +17 +12 +21 +58 +8 +44
In-domain evaluation results. Results marked as † indicates results from original papers, * denotes results obtained using the provided online software, N/A indicates methods not applicable or not designed for certain tasks, and - indicates unreported scores in original papers. Note that the setup of some domain-specific methods is not the same as ours, thus the results listed are for reference purposes.
Datasets CPSC 2018 CSN G12EC MMMU ECG ECG Arena
Metric AUC F1 HL Accuracy Accuracy Accuracy Arena Score
Random 51.2 15.1 28.8 11.6 12.1 24.2 0
Domain-specific Methods
METS - - - N/A N/A N/A N/A
MERL 82.8† - - N/A N/A N/A N/A
ST-MEM 70.4† - - N/A N/A N/A N/A
ECG-GPT 69.3* 44.0* 9.9* N/A N/A N/A N/A
Proprietary MLLMs
GPT-4o 50.9 10.6 18.2 57.5 49.2 43.5 33.5
GPT-4o mini 49.2 11.0 25.5 32.1 33.2 39.5 30.1
Gemini-1.5-Pro 50.1 7.4 20.5 50.5 36.0 40.0 31.2
Claude 3.5 Sonnet 52.8 11.5 18.9 51.5 51.4 42.0 37.1
Open-source MLLMs
LLaVA-Med 50.0 2.5 20.2 13.8 14.1 27.0 15.9
LLaVA-1.5-7B 50.0 2.5 20.0 32.1 25.4 33.0 12.7
LLaVA-1.5-13B 50.4 13.3 30.1 30.7 30.7 35.0 13.1
LLaVA-1.6-Vicuna-7B 50.5 19.7 66.0 23.7 23.3 28.0 16.0
LLaVA-1.6-Vicuna-13B 50.0 19.3 62.8 31.4 35.0 38.0 17.9
LLaVA-1.6-34B 49.6 19.3 62.8 44.3 45.9 31.0 17.5
LLaVA-OneVision-7B 49.6 8.0 28.3 23.3 25.7 26.0 22.5
LLaVA-OneVision-72B 51.5 12.8 29.4 44.0 42.6 35.0 15.5
Deepseek-VL-Chat-7B 50.7 6.0 20.0 35.7 32.9 34.5 15.3
Idefics2-8B 49.0 17.9 47.9 22.8 26.2 36.0 4.9
Mantis-8B-siglip-Llama3 51.3 19.1 48.5 17.6 22.6 38.5 13.6
MiniCPM-2.6 50.0 18.0 48.4 12.7 19.6 34.5 20.4
Phi-3-Vision-128k-Instruct 50.6 19.0 70.2 14.8 18.4 31.0 11.3
Qwen2-VL-7B 49.4 17.5 46.3 25.5 32.9 31.5 8.5
Qwen2-VL-72B 50.7 9.8 18.9 35.5 42.9 35.0 10.3
InternVL2-8B 52.1 8.2 22.2 47.7 37.5 30.0 22.9
InternVL2-40B 52.4 8.2 21.4 41.0 45.0 30.5 28.0
InternVL2-Llama3-76B 51.3 6.5 20.4 26.6 34.7 38.0 22.5
PULSE-7B (Ours) 76.9 57.6 8.6 85.2 78.2 58.0 38.9
Δ over best proprietary MLLM +24 +46 +10 +28 +27 +15 +2
Δ over best open-source MLLM +25 +38 +10 +38 +33 +20 +11
Out-of-domain evaluation results. Results marked as † indicates results from original papers, * denotes results obtained using the provided online software, N/A indicates methods not applicable or not designed for certain tasks, and - indicates unreported scores in original papers.

BibTeX


      @article{liu2024teach,
        title={Teach Multimodal LLMs to Comprehend Electrocardiographic Images},
        author={Ruoqi Liu, Yuelin Bai, Xiang Yue, Ping Zhang},
        journal={arXiv preprint arXiv:2410.19008},
        year={2024}
      }