Probabilistic learning models for topic extraction i Thai language
Conference proceedings article
Authors/Editors
Strategic Research Themes
No matching items found.
Publication Details
Author list: Asawaroengchai C., Chaisangmongkon W., Laowattana D.
Publisher: Hindawi
Publication year: 2018
Start page: 35
End page: 40
Number of pages: 6
ISBN: 9781538652541
ISSN: 0146-9428
eISSN: 1745-4557
Languages: English-Great Britain (EN-GB)
Abstract
Natural language processing (NLP) in Thai language is notoriously complicated. One major problem is the lack of word boundary in a sentence, introducing ambiguity in word tokenization. For topic extraction, semantic ambiguity adds another layer of complexity to the problem. Topic model that disregards word order, such as Latent Dirichlet Allocation (LDA), performs poorly in Thai Language. In this paper, we experimented and tested a probabilistic language model equipped with word location information, the so-called Topic N-grams model (TNG). We deployed several testing tasks to assess TNG's capabilities of modeling the generative process of Thai text and established benchmarks that compare the performance of LDA and TNG for various NLP tasks in Thai language. To our knowledge, this paper is the first to explore word-order model in Thai language topic extraction. We concluded that TNG can help boosting performance of Thai language processing in word cutting, semantic checking, word prediction, and document generation task. We also explored how we can measure performance of LDA and TNG on such tasks using perplexity. ฉ 2018 IEEE.
Keywords
LDA, TNG, Topic Modeling, Topic N-grams, Word Cutting