Teach Multimodal LLMs to Comprehend Electrocardiographic Images

Abstract

The electrocardiogram (ECG) is an essential non-invasive diagnostic tool for assessing cardiac conditions. Existing automatic interpretation methods suffer from limited generalizability, focusing on a narrow range of cardiac conditions, and typically depend on raw physiological signals, which may not be readily available in resource-limited settings where only printed or digital ECG images are accessible. Recent advancements in multimodal large language models (MLLMs) present promising opportunities for addressing these challenges. However, the application of MLLMs to ECG image interpretation remains challenging due to the lack of instruction tuning datasets and well-established ECG image benchmarks for quantitative evaluation. To address these challenges, we introduce ECGInstruct, a comprehensive ECG image instruction tuning dataset of over one million samples, covering a wide range of ECG-related tasks from diverse data sources. Using ECGInstruct, we develop PULSE, an MLLM tailored for ECG image comprehension. In addition, we curate ECGBench, a new evaluation benchmark covering four key ECG image interpretation tasks across nine different datasets. Our experiments show that PULSE sets a new state-of-the-art, outperforming general MLLMs with an average accuracy improvement of 15% to 30%. This work highlights the potential of PULSE to enhance ECG interpretation in clinical practice.

Figure 1. Overview of PULSE results. The PULSE demonstrates superior performance across multiple in-domain and out-of-domain datasets on our constructed ECGBench compared with advanced proprietary MLLMs (e.g., GPT-4o). Notably, the proprietary MLLMs often fail to accurately interpret ECG images, generating well-structured and contextually relevant responses but ultimately incorrect (with errors highlighted in red) compared to the ground truth diagnosis.

Dataset: ECGInstruct

Figure 2. ECGInstruct: a list of diverse and large-scale instruction tuning datasets for ECG image interpretation. (1) ECG images are synthesized from raw signal recordings with various distortions that mimic real-world printed ECG images. (2) ECGInstruct is curated based on clinician-defined ECG-related tasks, original diagnosis and clinical reports, and diverse task types. Additional quality checking is applied to filter lower-scored instructions.

Summary of ECGInstruct

Source Dataset	Task	Type	# Samples
PTB-XL	Feature	Close/Open/Fill/MCQ	30K
	Rhythm	Close/Open/Fill/MCQ	36K
	Morphology	Close/Open/Fill/MCQ	67K
	Report	Open	16K
ECG-QA	Feature	Close	40K
	Rhythm	Close	9K
	Morphology	Close	90K
MIMIC-IV-ECG	Feature	Close/Open/Fill/MCQ	29K
	Rhythm	Close/Open/Fill/MCQ	115K
	Morphology	Close/Open/Fill/MCQ	169K
	Report	Open	487K
CODE-15%	Feature	Close	22K
	Rhythm	Close	14K
	Morphology	Close	31K
Total			1.2M

Benchmark: ECGBench

Figure 3. The data curation process for ECGBench. There are four key tasks involved: (1) two repurposed tasks (abnormality detection and report generation) derived from existing ECG datasets, where ECG images are synthesized from raw signals, and queries/answers are extracted based on diagnostic and clinical reports; (2) Two newly developed tasks using external resources, where ECG images and associated questions and answers are collected and generated from real-world sources.

Summary of ECGBench

Evaluation Dataset	Task	Type	# Samples	In-Domain?
PTB-XL Super	Abnormality Detection	Close-ended	2,082	YES
PTB-XL Report	Report Generation	Open-ended	500	YES
CODE-15%	Abnormality Detection	Close-ended	1,400	YES
ECG-QA	Abnormality Detection	Close-ended	1,317	YES
CPSC 2018	Abnormality Detection	Close-ended	2,061	NO
CSN	Abnormality Detection	MCQ (8-option)	1,611	NO
G12EC	Abnormality Detection	MCQ (8-option)	2,026	NO
MMMU ECG	Multimodal Understanding	MCQ (4-option)	200	NO
ECG Arena	Multi-turn Conversation	Open-ended	50	NO

Results

In-domain evaluation results. Results marked as ^† indicates results from original papers, ^* denotes results obtained using the provided online software, N/A indicates methods not applicable or not designed for certain tasks, and - indicates unreported scores in original papers. Note that the setup of some domain-specific methods is not the same as ours, thus the results listed are for reference purposes.
Datasets	PTB-XL Super			PTB-XL Report	CODE-15%			ECG-QA
Metric	AUC	F1	HL	Report Score	AUC	F1	HL	Accuracy
Random	50.3	33.2	50.1	0	48.8	15.0	32.1	16.2
Domain-specific Methods
METS	-	65.7^†	-	N/A	-	-	-	N/A
MERL	74.2^†	-	-	N/A	-	-	-	N/A
ST-MEM	71.4^†	-	-	N/A	-	-	-	N/A
ECG-GPT	69.5^*	53.9^*	20.1^*	47.8^*	68.9^*	40.1^*	17.4^*	N/A
Proprietary MLLMs
GPT-4o	55.6	28.3	26.2	50.2	59.9	24.9	15.7	35.2
GPT-4o mini	52.0	20.4	31.7	37.1	57.5	22.0	15.1	14.9
Gemini 1.5 Pro	50.7	15.3	27.9	35.9	56.7	20.0	15.9	33.2
Claude 3.5 Sonnet	54.0	27.5	29.6	43.7	58.3	20.3	17.8	34.2
Open-source MLLMs
LLaVA-Med	50.0	12.3	28.1	24.3	69.2	27.0	33.4	29.5
LLaVA-1.5-7B	50.0	12.3	28.1	27.2	63.9	19.2	25.3	25.2
LLaVA-1.5-13B	50.0	35.2	48.4	20.7	53.9	13.1	13.6	21.2
LLaVA-1.6-Vicuna-7B	50.0	15.8	29.4	16.5	50.1	1.0	13.6	13.3
LLaVA-1.6-Vicuna-13B	50.0	20.1	38.3	5.9	53.0	3.6	16.6	22.0
LLaVA-1.6-34B	50.2	19.9	36.0	17.0	57.2	12.8	16.6	22.4
LLaVA-OneVision-7B	49.8	11.4	34.5	30.0	58.7	17.0	20.6	20.4
LLaVA-OneVision-72B	50.6	29.6	50.4	40.6	52.3	7.0	13.1	25.0
Deepseek-VL-Chat-7B	50.9	15.7	27.9	15.6	63.7	27.5	22.4	21.1
Idefics2-8B	50.7	21.9	31.2	10.6	49.0	17.9	47.9	26.1
Mantis-8B-siglip-Llama3	50.6	20.4	30.0	16.0	57.5	17.9	15.7	23.8
MiniCPM-V-2.6	49.0	37.7	63.8	15.4	56.6	25.3	22.0	20.8
Phi-3-Vision-128k-Instruct	50.0	29.6	48.4	20.2	69.6	22.6	38.8	28.4
Qwen2-VL-7B	51.3	22.4	30.8	43.0	60.7	24.8	20.5	20.4
Qwen2-VL-72B	54.0	28.3	30.2	48.9	60.6	23.6	16.1	23.7
InternVL2-8B	50.6	14.3	27.8	38.1	55.8	16.1	17.7	22.3
InternVL2-40B	51.2	18.7	34.6	41.8	56.7	16.2	17.4	18.2
InternVL2-Llama3-76B	50.4	9.4	35.6	41.4	59.0	20.2	20.5	21.8
PULSE-7B (Ours)	82.4	74.8	11.0	61.3	90.7	85.4	5.0	73.8
Δ over best proprietary MLLM	+27	+47	+15	+11	+30	+61	+10	+39
Δ over best open-source MLLM	+28	+37	+17	+12	+21	+58	+8	+44

Out-of-domain evaluation results. Results marked as ^† indicates results from original papers, ^* denotes results obtained using the provided online software, N/A indicates methods not applicable or not designed for certain tasks, and - indicates unreported scores in original papers.
Datasets	CPSC 2018			CSN	G12EC	MMMU ECG	ECG Arena
Metric	AUC	F1	HL	Accuracy	Accuracy	Accuracy	Arena Score
Random	51.2	15.1	28.8	11.6	12.1	24.2	0
Domain-specific Methods
METS	-	-	-	N/A	N/A	N/A	N/A
MERL	82.8^†	-	-	N/A	N/A	N/A	N/A
ST-MEM	70.4^†	-	-	N/A	N/A	N/A	N/A
ECG-GPT	69.3^*	44.0^*	9.9^*	N/A	N/A	N/A	N/A
Proprietary MLLMs
GPT-4o	50.9	10.6	18.2	57.5	49.2	43.5	33.5
GPT-4o mini	49.2	11.0	25.5	32.1	33.2	39.5	30.1
Gemini-1.5-Pro	50.1	7.4	20.5	50.5	36.0	40.0	31.2
Claude 3.5 Sonnet	52.8	11.5	18.9	51.5	51.4	42.0	37.1
Open-source MLLMs
LLaVA-Med	50.0	2.5	20.2	13.8	14.1	27.0	15.9
LLaVA-1.5-7B	50.0	2.5	20.0	32.1	25.4	33.0	12.7
LLaVA-1.5-13B	50.4	13.3	30.1	30.7	30.7	35.0	13.1
LLaVA-1.6-Vicuna-7B	50.5	19.7	66.0	23.7	23.3	28.0	16.0
LLaVA-1.6-Vicuna-13B	50.0	19.3	62.8	31.4	35.0	38.0	17.9
LLaVA-1.6-34B	49.6	19.3	62.8	44.3	45.9	31.0	17.5
LLaVA-OneVision-7B	49.6	8.0	28.3	23.3	25.7	26.0	22.5
LLaVA-OneVision-72B	51.5	12.8	29.4	44.0	42.6	35.0	15.5
Deepseek-VL-Chat-7B	50.7	6.0	20.0	35.7	32.9	34.5	15.3
Idefics2-8B	49.0	17.9	47.9	22.8	26.2	36.0	4.9
Mantis-8B-siglip-Llama3	51.3	19.1	48.5	17.6	22.6	38.5	13.6
MiniCPM-2.6	50.0	18.0	48.4	12.7	19.6	34.5	20.4
Phi-3-Vision-128k-Instruct	50.6	19.0	70.2	14.8	18.4	31.0	11.3
Qwen2-VL-7B	49.4	17.5	46.3	25.5	32.9	31.5	8.5
Qwen2-VL-72B	50.7	9.8	18.9	35.5	42.9	35.0	10.3
InternVL2-8B	52.1	8.2	22.2	47.7	37.5	30.0	22.9
InternVL2-40B	52.4	8.2	21.4	41.0	45.0	30.5	28.0
InternVL2-Llama3-76B	51.3	6.5	20.4	26.6	34.7	38.0	22.5
PULSE-7B (Ours)	76.9	57.6	8.6	85.2	78.2	58.0	38.9
Δ over best proprietary MLLM	+24	+46	+10	+28	+27	+15	+2
Δ over best open-source MLLM	+25	+38	+10	+38	+33	+20	+11

BibTeX


      @article{liu2024teach,
        title={Teach Multimodal LLMs to Comprehend Electrocardiographic Images},
        author={Ruoqi Liu, Yuelin Bai, Xiang Yue, Ping Zhang},
        journal={arXiv preprint arXiv:2410.19008},
        year={2024}
      }