Thai Tokenization with Attention Mechanism

Conference proceedings article

ผู้เขียน/บรรณาธิการ

สันติธรรม พรหมอ่อน

กลุ่มสาขาการวิจัยเชิงกลยุทธ์

การเปลี่ยนแปลงด้วยเทคโนโลยีดิจิตอล (รูปแบบการวิจัยเชิงกลยุทธ์)

รายละเอียดสำหรับงานพิมพ์

รายชื่อผู้แต่ง: Atiwetsakun, Jednipat; Prom-On, Santitham;

ผู้เผยแพร่: Elsevier

ปีที่เผยแพร่ (ค.ศ.): 2021

หน้าแรก: 17

หน้าสุดท้าย: 22

จำนวนหน้า: 6

ISBN: 9781665428415

นอก: 0928-4931

eISSN: 1873-0191

URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85117449044&doi=10.1109%2fIBDAP52511.2021.9552074&partnerID=40&md5=91dfea9114283128d75cf8bb59d976fb

ภาษา: English-Great Britain (EN-GB)

ดูบนเว็บไซต์ของสำนักพิมพ์

บทคัดย่อ

Word segmentation is an important preprocessing step in natural language processing applications, particularly in languages with no demarcation indicators including Thai. A simple method like dictionary-based segmentation does not consider the context of the sentence. This paper proposes an attention-based deep learning approach for Thai word segmentation. With the help of attention, the model can learn character correlations across the entire sentence without gradient vanishing or gradient explode problems and tokenize them into word vectors. The goal of this research is to test three different types of attention mechanisms to determine the effectiveness of word tokenization. The visualization of attention for each attention mechanism is also included as an outcome. © 2021 IEEE.

คำสำคัญ

attention, Thai language, word segmentation