Deepfake Speech Detection Through Emotion Recognition: A Semantic Approach

Emanuele Conti; Davide Salvi; Clara Borrelli; Brian Hosler; Paolo Bestagini; Fabio Antonacci; Augusto Sarti; Matthew C. Stamm; Stefano Tubaro

doi:10.1109/ICASSP43922.2022.9747186

Back

Deepfake Speech Detection Through Emotion Recognition: A Semantic Approach

Conference proceeding

Open access

Deepfake Speech Detection Through Emotion Recognition: A Semantic Approach

Emanuele Conti, Davide Salvi, Clara Borrelli, Brian Hosler, Paolo Bestagini, Fabio Antonacci, Augusto Sarti, Matthew C. Stamm and Stefano Tubaro

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 8962-8966

23 May 2022

DOI: https://doi.org/10.1109/ICASSP43922.2022.9747186

Files and links (1)

url

https://re.public.polimi.it/bitstream/11311/1220518/1/2022_ICASSP_audio_deepfake_emotions%20%281%29.pdfView

SubmittedOpen Access (License Unspecified), Open

Abstract

audio forensics

deep learning

deepfake

Emotion recognition

Semantics

Signal processing algorithms

Speech recognition

Streaming media

Transfer learning

Voice activity detection

In recent years, audio and video deepfake technology has advanced relentlessly, severely impacting people's reputation and reliability. Several factors have facilitated the growing deepfake threat. On the one hand, the hyper-connected society of social and mass media enables the spread of multimedia content worldwide in real-time, facilitating the dissemination of counterfeit material. On the other hand, neural network-based techniques have made deepfakes easier to produce and difficult to detect, showing that the analysis of low-level features is no longer sufficient for the task. This situation makes it crucial to design systems that allow detecting deepfakes at both video and audio levels. In this paper, we propose a new audio spoofing detection system leveraging emotional features. The rationale behind the proposed method is that audio deepfake techniques cannot correctly synthesize natural emotional behavior. Therefore, we feed our deepfake detector with high-level features obtained from a state-of-the-art Speech Emotion Recognition (SER) system. As the used descriptors capture semantic audio information, the proposed system proves robust in cross-dataset scenarios outperforming the considered baseline on multiple datasets.

Metrics

28 Record Views

43 citations in Web of Science

54 citations in Scopus

See more details

Details

Title: Deepfake Speech Detection Through Emotion Recognition: A Semantic Approach
Creators: Emanuele Conti - Politecnico di Milano
Davide Salvi - Politecnico di Milano
Clara Borrelli - Politecnico di Milano
Brian Hosler - Drexel University
Paolo Bestagini - Politecnico di Milano
Fabio Antonacci - Politecnico di Milano
Augusto Sarti - Politecnico di Milano
Matthew C. Stamm - Drexel University
Stefano Tubaro - Politecnico di Milano
Publication Details: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 8962-8966
Publisher: IEEE
Grant note: Air Force Research Laboratory (10.13039/100006602)
Resource Type: Conference proceeding
Language: English
Academic Unit: Electrical and Computer Engineering
Web of Science ID: WOS:000864187909055
Scopus ID: 2-s2.0-85131238453
Other Identifier: 991019173717604721

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Collaboration types: Domestic collaboration; International collaboration
Web of Science research areas: Acoustics; Computer Science, Artificial Intelligence; Engineering, Electrical & Electronic

Deepfake Speech Detection Through Emotion Recognition: A Semantic Approach

Files and links (1)

Abstract

Metrics

Details

InCites Highlights

Drexel University Social media