Thai Tokenization with Attention Mechanism

Conference proceedings article

Authors/Editors

SANTITHAM PROM-ON

Strategic Research Themes

Digital Transformation (Strategic Research Themes)

Publication Details

Author list: Atiwetsakun, Jednipat; Prom-On, Santitham;

Publisher: Elsevier

Publication year: 2021

Start page: 17

End page: 22

Number of pages: 6

ISBN: 9781665428415

ISSN: 0928-4931

eISSN: 1873-0191

URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85117449044&doi=10.1109%2fIBDAP52511.2021.9552074&partnerID=40&md5=91dfea9114283128d75cf8bb59d976fb

Languages: English-Great Britain (EN-GB)

View on publisher site

Abstract

Word segmentation is an important preprocessing step in natural language processing applications, particularly in languages with no demarcation indicators including Thai. A simple method like dictionary-based segmentation does not consider the context of the sentence. This paper proposes an attention-based deep learning approach for Thai word segmentation. With the help of attention, the model can learn character correlations across the entire sentence without gradient vanishing or gradient explode problems and tokenize them into word vectors. The goal of this research is to test three different types of attention mechanisms to determine the effectiveness of word tokenization. The visualization of attention for each attention mechanism is also included as an outcome. © 2021 IEEE.

Keywords

attention, Thai language, word segmentation