Learning vocal tract shapes of thai vowels from acoustical data: A preliminary study

Conference proceedings article

Authors/Editors

SANTITHAM PROM-ON

Strategic Research Themes

No matching items found.

Publication Details

Author list: Prom-On S.

Publisher: Hindawi

Publication year: 2013

Volume number: 3

Start page: 2526

End page: 2530

Number of pages: 5

eISSN: 1745-4557

URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-84897077431&partnerID=40&md5=8772b9cfd877037dd9cb52cdcc459a58

Languages: English-Great Britain (EN-GB)

Abstract

This paper investigates the vocal tract shapes of Thai vowels by implementing an analysis-by-synthesis strategy for parameter estimation with the VocalTractLab, an articulatory synthesizer capable of synthesizing a full range of speech sounds from the articulatory movements defined by a sequence of vocal tract shapes and the target approximation process. Sentence stimuli were designed to highlight the contextual variations of Thai vowels by varying nine Thai long vowels (/a:/, /i:/, /u:/, /e:/, /ε:/, /ω:/, /α/, /o:/, /c:/) on two syllables. For this preliminary study, speech data, consisting of 81 disyllabic utterances, were recorded from a native Thai speaker. Vocal tract shapes were estimated by optimizing the vocal tract shape parameters of each vowel to minimize the sum of square error of Mel-Frequency Cepstral Coefficients (MFCC) between original and synthesized speech based on an articulatory synthesizer. Stochastic gradient descent algorithm was used to optimize the shape parameters. Parameters of all shapes were first initialized to those of the neutral vowel (schwa) and then iteratively and randomly adjusted toward the new articulatory target. Each new target position is accepted only when it results in a lower total error between synthesized and original MFCC data. The optimization process was repeated a number of times until there are no more significant changes in errors. The optimized vocal tract shapes can then be used to accurately synthesize Thai vowels either as an isolated syllable or a continuous utterance. They also closely resembled the actual pronunciation. This result indicates the potential of this analysis strategy that allows us to effectively and economically estimate the vocal tract shapes without using the actual imaging data.

Keywords