An investigation into variability conditions in the SRE 2004 and 2008 corpora

David A. Cinciruk

doi:10.17918/etd-4033

In Automatic Speaker Verification, a computer must detemine if a certain speech segment was spoken by a target speaker from whom speech had been previously provided. Speech segments are taken over many conditions such as different telephones, microphones, languages, and dialects. Differences in these conditions result in a variability that can both negatively and positively affect the performance of speaker recognition systems. While the error rates are sometimes unpredictable, the large differences between the error rates of different conditions provokes interest in ways to normalize speech segments to compensate for this variability. With a compensation technique, the error rates should decrease and become more consistent between the different conditions used to record them. The majority of research in the speaker recognition community focuses on techniques to reduce the effects of variability without analyzing what factors actually affect performance the most. To show the need for a form of variabiality compensation in speaker recognition as well as to determine the types of variability factors that most significantly influence performance, a speaker recognition system without any compensation techniques was formed and tested on the core conditions of NIST's Speaker Recognition Evaluations (SREs) 2004 and 2008. These two datasets are from a series of datasets that organizations in the speaker recognition community use most often to show performance for their speaker verification system. The false alarm and missed detection rates for individual training and target conditions were analyzed at the equal error point over each dataset. The experiments show that language plays a significant role in affecting the performance; however, dialect does not appear to have any influence at all. Consistently, English was proven to provide the best results for speaker recognition with baseline systems of the form utilized in this thesis. While there does not seem to be a single best phone and microphone for speaker recognition systems, consistent performance could be seen when the type of phone and microphone used is the same for both training and testing (matched) and when they are different (mismatched). Higher missed detection rates could be seen in mismatched conditions and higher false alarm rates could be seen in matched conditions. Interview speech was also found to have a much higher difference between false alarm and missed detection than phone speech. The thesis culminates with an in-depth of the error performance as a function of these and other various variability factors.

An investigation into variability conditions in the SRE 2004 and 2008 corpora

Files and links (1)

Abstract

Metrics

Details

An investigation into variability conditions in the SRE 2004 and 2008 corpora

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media