Thai Tokenization with Attention Mechanism
Conference proceedings article
Authors/Editors
Strategic Research Themes
Publication Details
Author list: Atiwetsakun, Jednipat; Prom-On, Santitham;
Publisher: Elsevier
Publication year: 2021
Start page: 17
End page: 22
Number of pages: 6
ISBN: 9781665428415
ISSN: 0928-4931
eISSN: 1873-0191
Languages: English-Great Britain (EN-GB)
Abstract
Word segmentation is an important preprocessing step in natural language processing applications, particularly in languages with no demarcation indicators including Thai. A simple method like dictionary-based segmentation does not consider the context of the sentence. This paper proposes an attention-based deep learning approach for Thai word segmentation. With the help of attention, the model can learn character correlations across the entire sentence without gradient vanishing or gradient explode problems and tokenize them into word vectors. The goal of this research is to test three different types of attention mechanisms to determine the effectiveness of word tokenization. The visualization of attention for each attention mechanism is also included as an outcome. © 2021 IEEE.
Keywords
attention, Thai language, word segmentation