Thai Tokenization with Attention Mechanism

Conference proceedings article


Authors/Editors


Strategic Research Themes


Publication Details

Author listAtiwetsakun, Jednipat; Prom-On, Santitham;

PublisherElsevier

Publication year2021

Start page17

End page22

Number of pages6

ISBN9781665428415

ISSN0928-4931

eISSN1873-0191

URLhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-85117449044&doi=10.1109%2fIBDAP52511.2021.9552074&partnerID=40&md5=91dfea9114283128d75cf8bb59d976fb

LanguagesEnglish-Great Britain (EN-GB)


View on publisher site


Abstract

Word segmentation is an important preprocessing step in natural language processing applications, particularly in languages with no demarcation indicators including Thai. A simple method like dictionary-based segmentation does not consider the context of the sentence. This paper proposes an attention-based deep learning approach for Thai word segmentation. With the help of attention, the model can learn character correlations across the entire sentence without gradient vanishing or gradient explode problems and tokenize them into word vectors. The goal of this research is to test three different types of attention mechanisms to determine the effectiveness of word tokenization. The visualization of attention for each attention mechanism is also included as an outcome. © 2021 IEEE.


Keywords

attentionThai languageword segmentation


Last updated on 2023-26-09 at 07:37