Benchmarking Cloud-based Speech-to-Text APIs for Multilingual Meetings: A Comparative Study on English, Thai, and Malay

Conference proceedings article

Authors/Editors

VAJIRASAK VANIJJA

Strategic Research Themes

Publication Details

Author list: Ampuan Zuhairah Ampuan Hj Zainal, Vajirasak Vanijja, Arif Bramantoro

Publication year: 2025

Start page: 1

End page: 6

Number of pages: 6

URL: https://ieeexplore.ieee.org/document/11338283

Languages: English-United States (EN-US)

View on publisher site

Abstract

This study evaluates three leading cloud-based speech-to-text transcription services—Google Cloud, Amazon Transcribe, and Azure Speech Services—for transcribing multilingual meeting audio in Thai, English, and Malay. The evaluation considers transcription accuracy, processing time, and cost-efficiency, using real-world meeting recordings incorporating noise reduction, audio segmentation, and overlapping techniques to simulate practical conditions. Performance is measured by Word Error Rate (WER) and Character Error Rate (CER), with CER applied to Thai due to its non-segmented script. Results show English as the highest transcription accuracy (up to 90.4%) with extensive training resources Thai improved from 55.1% to 77.1%, and Malay from below 50% to ~71.4% after preprocessing. Azure and Amazon outperformed Google for Thai, with segmentation (60s chunks, 2s overlaps) reducing latency by 72%. Azure Batch Transcription was most cost-effective, averaging 0.18 USD per hour with 76.4% mean accuracy. These findings highlight both the strengths and gaps of current cloud ASR, especially for low-resource languages, and provide actionable insights for organizations seeking scalable multilingual transcription solutions.

Keywords

Automatic Speech Recognition, Azure Batch Transcription, Character Error Rate, cloud API benchmarking, English, low-resource languages, Malay, speech-to-text evaluation, Thai, Word Error Rate