Dementia Detection from Spontaneous Speech Using Cross-Attention Fusion

Felix Agbavor; Hualou Liang

doi:10.3390/jdad3010012

Back

Dementia Detection from Spontaneous Speech Using Cross-Attention Fusion

Journal article

Open access

Peer reviewed

Dementia Detection from Spontaneous Speech Using Cross-Attention Fusion

Felix Agbavor and Hualou Liang

Journal of Dementia and Alzheimer's Disease, v 3(1), 12

02 Mar 2026

DOI: https://doi.org/10.3390/jdad3010012

Featured in Collection : Drexel's Newest Publications

Files and links (1)

url

https://doi.org/10.3390/jdad3010012View

Published, Version of Record (VoR)Open Access Discount via Drexel Libraries Read and Publish Program 2026CC BY V4.0, Open

Abstract

Background/Objectives: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that affects the daily lives of older adults, impacting their cognitive abilities as well as speech and language communication. Early detection is crucial, as it enables timely intervention and helps improve the quality of life for those affected. While large language models (LLMs) have shown promise from spontaneous speech, most studies are unimodal and miss complementary signals across modalities. Methods: We present an LLM-powered multimodal cross-attention framework that integrates lexical (text), acoustic (speech), and visual (image) information for dementia detection using the ADReSSo 2021 picture-description dataset. Within this framework, text data are encoded using the ModernBERT, audio features are extracted using the wav2vec 2.0-base-960, and the Cookie Theft image is represented through the CLIP ViT-L/14. These embeddings are linearly projected to a shared space and then combined via Transformer-based cross-attention, yielding a fused vector for AD detection. Results: Our results show that the trimodal model achieved the best overall performance when paired with an SVC classifier, reaching an accuracy of 0.8732 and an F1 score of 0.8571, surpassing both the top-performing unimodal and bimodal configurations. For interpretability, a sensitivity analysis of modality contributions reveals that text plays the primary role, audio provides complementary improvements, and image offers modest yet stabilizing contextual support. Conclusions: These results highlight that the method of multimodal embedding fusion significantly influences performance: a cross-attention block achieves an effective balance between accuracy and simplicity, producing integrated representations that align well with interpretable downstream classifiers.

Metrics

2 Record Views

See more details

Details

Title: Dementia Detection from Spontaneous Speech Using Cross-Attention Fusion
Creators: Felix Agbavor - Drexel University
Hualou Liang (Corresponding Author) - Drexel University
Publication Details: Journal of Dementia and Alzheimer's Disease, v 3(1), 12
Publisher: MDPI
Resource Type: Journal article
Language: English
Academic Unit: School of Biomedical Engineering, Science, and Health Systems
Other Identifier: 991022166477104721

Dementia Detection from Spontaneous Speech Using Cross-Attention Fusion

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media