A tailor-made approach to Thai word segmentation for topic-specific research

Conference proceedings article

Authors/Editors

Strategic Research Themes

Big Data Analytics (Digital Transformation)

Publication Details

Author list: Punjaporn Pojanapunya and Duangjaichanok Pansa

Publication year: 2021

Start page: 94

End page: 109

Number of pages: 16

URL: https://sola.pr.kmutt.ac.th/dral2021/wp-content/uploads/2022/06/7.pdf

Languages: English-United States (EN-US)

Abstract

Segmenting Thai words for use in corpus-based studies is a complex task. Two major approaches for Thai word segmentation are dictionary-based (DCB) and machine learning-based (MLB). However, it is unclear which method produces the most appropriate segmented text for use in a corpus-based analysis. This paper describes a novel third approach, a two-level segmentation which segments text by using specifically designed criteria. By integrating existing approaches with specific criteria, this method segments Thai text into the shortest syllables or words and then creates longer words from 2-word, 3-word and 4-word clusters by using a reference glossary of terms as the basis for identifying clusters. For this study, all three methods were tested on a corpus of interviews on language teachers’ views on assessment. For the first two methods, word units were segmented by ready-made programs, LexTo (DCB) and TLex (MLB). Advantages and drawbacks of these three methods for the purpose of facilitating analysts who prepare Thai texts for corpus linguistics are discussed.

Keywords

No matching items found.