Speaker Classification from vowel sound segments

Andrés G.D. Vargas, Johana M.L. Florez, G. Pedro Vizcaya

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This study proposes a speaker detection method based on vowel segments and transfer learning using VGGish and YAMNet networks. The implementation of an X-vector based system was also explored, which did not give good results with the samples it was trained on. A compact system for isolating vowel segments from audio recordings was also implemented. The classification system using Parselmouth and a neural network showed effectiveness with an average accuracy of 89.81%. The DIMEx100 corpus served as a consistent database for training, and Parselmouth demonstrated its effectiveness for audio analysis and acoustic feature extraction. Transfer learning applied to the VGGish and YAMNet networks proved to be effective, adapting to the specific task and achieving significant levels of accuracy in vowel classification. Variations in accuracy were observed depending on the vowels, with some exceeding 98% and others hovering around 94-95%. The results confirm the applicability of transfer learning in the classification of speakers and vowel segments, opening new lines of research in the field of speaker identification in Spanish.

Original languageEnglish
Title of host publicationAES New York 2023
Subtitle of host publication155th Audio Engineering Society Convention
EditorsAreti Andreopoulou, Braxton Boren
PublisherAudio Engineering Society
ISBN (Electronic)9781942220435
StatePublished - 2023
EventAES New York 2023: 155th Audio Engineering Society Convention - New York, United States
Duration: 25 Oct 202327 Oct 2023

Publication series

NameAES New York 2023: 155th Audio Engineering Society Convention

Conference

ConferenceAES New York 2023: 155th Audio Engineering Society Convention
Country/TerritoryUnited States
CityNew York
Period25/10/2327/10/23

Fingerprint

Dive into the research topics of 'Speaker Classification from vowel sound segments'. Together they form a unique fingerprint.

Cite this