A tailor-made approach to Thai word segmentation for topic-specific research

Conference proceedings article


Authors/Editors


Strategic Research Themes


Publication Details

Author listPunjaporn Pojanapunya and Duangjaichanok Pansa

Publication year2021

Start page94

End page109

Number of pages16

URLhttps://sola.pr.kmutt.ac.th/dral2021/wp-content/uploads/2022/06/7.pdf

LanguagesEnglish-United States (EN-US)


Abstract

Segmenting Thai words for use in corpus-based studies is a complex task. Two major approaches for Thai word segmentation are dictionary-based (DCB) and machine learning-based (MLB). However, it is unclear which method produces the most appropriate segmented text for use in a corpus-based analysis. This paper describes a novel third approach, a two-level segmentation which segments text by using specifically designed criteria. By integrating existing approaches with specific criteria, this method segments Thai text into the shortest syllables or words and then creates longer words from 2-word, 3-word and 4-word clusters by using a reference glossary of terms as the basis for identifying clusters. For this study, all three methods were tested on a corpus of interviews on language teachers’ views on assessment. For the first two methods, word units were segmented by ready-made programs, LexTo (DCB) and TLex (MLB). Advantages and drawbacks of these three methods for the purpose of facilitating analysts who prepare Thai texts for corpus linguistics are discussed.


Keywords

No matching items found.


Last updated on 2022-02-09 at 09:36