A tailor-made approach to Thai word segmentation for topic-specific research

Conference proceedings article

ผู้เขียน/บรรณาธิการ

กลุ่มสาขาการวิจัยเชิงกลยุทธ์

การวิเคราะห์ข้อมูลขนาดใหญ่ (การเปลี่ยนแปลงด้วยเทคโนโลยีดิจิตอล)

รายละเอียดสำหรับงานพิมพ์

รายชื่อผู้แต่ง: Punjaporn Pojanapunya and Duangjaichanok Pansa

ปีที่เผยแพร่ (ค.ศ.): 2021

หน้าแรก: 94

หน้าสุดท้าย: 109

จำนวนหน้า: 16

URL: https://sola.pr.kmutt.ac.th/dral2021/wp-content/uploads/2022/06/7.pdf

ภาษา: English-United States (EN-US)

บทคัดย่อ

Segmenting Thai words for use in corpus-based studies is a complex task. Two major approaches for Thai word segmentation are dictionary-based (DCB) and machine learning-based (MLB). However, it is unclear which method produces the most appropriate segmented text for use in a corpus-based analysis. This paper describes a novel third approach, a two-level segmentation which segments text by using specifically designed criteria. By integrating existing approaches with specific criteria, this method segments Thai text into the shortest syllables or words and then creates longer words from 2-word, 3-word and 4-word clusters by using a reference glossary of terms as the basis for identifying clusters. For this study, all three methods were tested on a corpus of interviews on language teachers’ views on assessment. For the first two methods, word units were segmented by ready-made programs, LexTo (DCB) and TLex (MLB). Advantages and drawbacks of these three methods for the purpose of facilitating analysts who prepare Thai texts for corpus linguistics are discussed.

คำสำคัญ

ไม่พบข้อมูลที่เกี่ยวข้อง