Multimodal Dementia Prediction with LLMs: Cross-Attention over Text, Audio, and Image

Felix Agbavor; Hualou Liang

doi:10.1145/3768322.3769098

Back

Multimodal Dementia Prediction with LLMs: Cross-Attention over Text, Audio, and Image

Conference proceeding

Open access

Multimodal Dementia Prediction with LLMs: Cross-Attention over Text, Audio, and Image

Felix Agbavor and Hualou Liang

Companion Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp 1-1

11 Oct 2025

DOI: https://doi.org/10.1145/3768322.3769098

Files and links (1)

url

https://doi.org/10.1145/3768322.3769098View

Published, Version of Record (VoR) Open

Abstract

Computing methodologies

Computing methodologies -- Artificial intelligence

Computing methodologies -- Artificial intelligence -- Computer vision

Computing methodologies -- Artificial intelligence -- Computer vision -- Computer vision tasks

Computing methodologies -- Artificial intelligence -- Computer vision -- Computer vision tasks -- Biometrics

Computing methodologies -- Artificial intelligence -- Natural language processing

Computing methodologies -- Artificial intelligence -- Natural language processing -- Information extraction

Information systems

Information systems -- Information retrieval

Information systems -- Information retrieval -- Retrieval tasks and goals

Information systems -- Information retrieval -- Retrieval tasks and goals -- Sentiment analysis

Information systems -- Information retrieval -- Specialized information retrieval

Information systems -- Information retrieval -- Specialized information retrieval -- Multimedia and multimodal retrieval

Security and privacy

Security and privacy -- Security services

We introduce a Large Language Model (LLM)-powered multimodal cross-attention framework (Figure 1) that aligns lexical (transcripts), acoustic (speech), and visual (stimulus image) information for dementia detection on the ADReSSo 2021 picture-description corpus [4], using the Cookie Theft scene as the visual stream. Building upon our previous work that large-language-model text embeddings can predict dementia from spontaneous speech [1], we advance from unimodal text to multimodal integration across text, audio and image. In this framework, text is encoded using ModernBERT [6] into 768-dimensional embeddings; audio is represented by self-supervised speech encoders (Wav2Vec2-base-960h and Data2Vec) [2, 3], resampled to 16 kHz and reduced from variable-length frame sequences to 768-dimensional vectors via mean or max pooling; the Cookie Theft image is embedded with CLIP ViT-L/14 (768-dimensional) [5]. In our extensive experiments, we first conduct a controlled benchmark evaluating encoder choice, pooling strategy, fusion graph, and downstream classifier family. We then compare two-modality and three-modality cross-attention graphs against gating and attentive-gating variants, and we finally present results using both linear and kernel classifiers to ensure our conclusions are not tied to a single backend. Multimodal three-way cross-attention achieves the strongest overall performance and most stable behavior. Our best setup-jointly cross-attending text + audio with text + image-reaches Accuracy = 0.8732 and F1 = 0.8656 on held-out splits, outperforming alternative three-way pairings while maintaining consistency across classifier families. Ablation studies reveal that neither gating nor attentive-gating outperforms cross-attention in any setting. However, the choice of acoustic representation is critical: Wav2Vec2 consistently enhances lexical-acoustic alignment compared to Data2Vec. A colorization stress test on the Cookie Theft stimulus confirms the robustness of the vision stream: modest perturbations in CLIP embeddings have negligible impact on downstream accuracy or F1. Taken together, these results highlight a powerful LLM-driven framework for multimodal dementia biomarkers: ModernBERT (text), Wav2Vec2 (audio), and CLIP (image) fused through multimodal cross-attention, with downstream classifiers kept flexible (e.g., linear or kernel) to match deployment needs. By focusing on the fusion mechanism, this study positions three-way cross-attention as a practical default for integrating lexical, acoustic, and visual signals in dementia modeling on ADReSSo 2021 [4].

Metrics

6 Record Views

Details

Title: Multimodal Dementia Prediction with LLMs: Cross-Attention over Text, Audio, and Image
Creators: Felix Agbavor - Drexel University, Philadelphia, PA, USA
Hualou Liang - Drexel University
Publication Details: Companion Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp 1-1
Conference: BCB Companion '25: Companion Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
Series: ACM Conferences
Publisher: ACM; NEW YORK
Number of pages: 1
Resource Type: Conference proceeding
Language: English
Academic Unit: School of Biomedical Engineering, Science, and Health Systems; [Retired Faculty]
Web of Science ID: WOS:001661442600026
Other Identifier: 991022138673304721

Multimodal Dementia Prediction with LLMs: Cross-Attention over Text, Audio, and Image

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media