Published, Version of Record (VoR)Open Access Discount via Drexel Libraries Read and Publish Program 2026CC BY V4.0, Open
Abstract
Background/Objectives: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that affects the daily lives of older adults, impacting their cognitive abilities as well as speech and language communication. Early detection is crucial, as it enables timely intervention and helps improve the quality of life for those affected. While large language models (LLMs) have shown promise from spontaneous speech, most studies are unimodal and miss complementary signals across modalities. Methods: We present an LLM-powered multimodal cross-attention framework that integrates lexical (text), acoustic (speech), and visual (image) information for dementia detection using the ADReSSo 2021 picture-description dataset. Within this framework, text data are encoded using the ModernBERT, audio features are extracted using the wav2vec 2.0-base-960, and the Cookie Theft image is represented through the CLIP ViT-L/14. These embeddings are linearly projected to a shared space and then combined via Transformer-based cross-attention, yielding a fused vector for AD detection. Results: Our results show that the trimodal model achieved the best overall performance when paired with an SVC classifier, reaching an accuracy of 0.8732 and an F1 score of 0.8571, surpassing both the top-performing unimodal and bimodal configurations. For interpretability, a sensitivity analysis of modality contributions reveals that text plays the primary role, audio provides complementary improvements, and image offers modest yet stabilizing contextual support. Conclusions: These results highlight that the method of multimodal embedding fusion significantly influences performance: a cross-attention block achieves an effective balance between accuracy and simplicity, producing integrated representations that align well with interpretable downstream classifiers.