Conference proceeding
Multimodal Dementia Prediction with LLMs: Cross-Attention over Text, Audio, and Image
Companion Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp 1-1
11 Oct 2025
Abstract
We introduce a Large Language Model (LLM)-powered multimodal cross-attention framework (Figure 1) that aligns lexical (transcripts), acoustic (speech), and visual (stimulus image) information for dementia detection on the ADReSSo 2021 picture-description corpus [4], using the Cookie Theft scene as the visual stream. Building upon our previous work that large-language-model text embeddings can predict dementia from spontaneous speech [1], we advance from unimodal text to multimodal integration across text, audio and image. In this framework, text is encoded using ModernBERT [6] into 768-dimensional embeddings; audio is represented by self-supervised speech encoders (Wav2Vec2-base-960h and Data2Vec) [2, 3], resampled to 16 kHz and reduced from variable-length frame sequences to 768-dimensional vectors via mean or max pooling; the Cookie Theft image is embedded with CLIP ViT-L/14 (768-dimensional) [5]. In our extensive experiments, we first conduct a controlled benchmark evaluating encoder choice, pooling strategy, fusion graph, and downstream classifier family. We then compare two-modality and three-modality cross-attention graphs against gating and attentive-gating variants, and we finally present results using both linear and kernel classifiers to ensure our conclusions are not tied to a single backend. Multimodal three-way cross-attention achieves the strongest overall performance and most stable behavior. Our best setup-jointly cross-attending text + audio with text + image-reaches Accuracy = 0.8732 and F1 = 0.8656 on held-out splits, outperforming alternative three-way pairings while maintaining consistency across classifier families. Ablation studies reveal that neither gating nor attentive-gating outperforms cross-attention in any setting. However, the choice of acoustic representation is critical: Wav2Vec2 consistently enhances lexical-acoustic alignment compared to Data2Vec. A colorization stress test on the Cookie Theft stimulus confirms the robustness of the vision stream: modest perturbations in CLIP embeddings have negligible impact on downstream accuracy or F1. Taken together, these results highlight a powerful LLM-driven framework for multimodal dementia biomarkers: ModernBERT (text), Wav2Vec2 (audio), and CLIP (image) fused through multimodal cross-attention, with downstream classifiers kept flexible (e.g., linear or kernel) to match deployment needs. By focusing on the fusion mechanism, this study positions three-way cross-attention as a practical default for integrating lexical, acoustic, and visual signals in dementia modeling on ADReSSo 2021 [4].
Metrics
6 Record Views
Details
- Title
- Multimodal Dementia Prediction with LLMs: Cross-Attention over Text, Audio, and Image
- Creators
- Felix Agbavor - Drexel University, Philadelphia, PA, USAHualou Liang - Drexel University
- Publication Details
- Companion Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp 1-1
- Conference
- BCB Companion '25: Companion Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
- Series
- ACM Conferences
- Publisher
- ACM; NEW YORK
- Number of pages
- 1
- Resource Type
- Conference proceeding
- Language
- English
- Academic Unit
- School of Biomedical Engineering, Science, and Health Systems; [Retired Faculty]
- Web of Science ID
- WOS:001661442600026
- Other Identifier
- 991022138673304721