A tailor-made approach to Thai word segmentation for topic-specific research
Conference proceedings article
Authors/Editors
Strategic Research Themes
Publication Details
Author list: Punjaporn Pojanapunya and Duangjaichanok Pansa
Publication year: 2021
Start page: 94
End page: 109
Number of pages: 16
URL: https://sola.pr.kmutt.ac.th/dral2021/wp-content/uploads/2022/06/7.pdf
Languages: English-United States (EN-US)
Abstract
Segmenting Thai words for use in corpus-based studies is a complex task. Two major approaches for Thai word segmentation are dictionary-based (DCB) and machine learning-based (MLB). However, it is unclear which method produces the most appropriate segmented text for use in a corpus-based analysis. This paper describes a novel third approach, a two-level segmentation which segments text by using specifically designed criteria. By integrating existing approaches with specific criteria, this method segments Thai text into the shortest syllables or words and then creates longer words from 2-word, 3-word and 4-word clusters by using a reference glossary of terms as the basis for identifying clusters. For this study, all three methods were tested on a corpus of interviews on language teachers’ views on assessment. For the first two methods, word units were segmented by ready-made programs, LexTo (DCB) and TLex (MLB). Advantages and drawbacks of these three methods for the purpose of facilitating analysts who prepare Thai texts for corpus linguistics are discussed.
Keywords
No matching items found.