Probabilistic learning models for topic extraction i Thai language

Conference proceedings article


Authors/Editors


Strategic Research Themes

No matching items found.


Publication Details

Author listAsawaroengchai C., Chaisangmongkon W., Laowattana D.

PublisherHindawi

Publication year2018

Start page35

End page40

Number of pages6

ISBN9781538652541

ISSN0146-9428

eISSN1745-4557

URLhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-85050154385&doi=10.1109%2fICBIR.2018.8391162&partnerID=40&md5=3d80c497c8e517ef843d8b0288f276c2

LanguagesEnglish-Great Britain (EN-GB)


View on publisher site


Abstract

Natural language processing (NLP) in Thai language is notoriously complicated. One major problem is the lack of word boundary in a sentence, introducing ambiguity in word tokenization. For topic extraction, semantic ambiguity adds another layer of complexity to the problem. Topic model that disregards word order, such as Latent Dirichlet Allocation (LDA), performs poorly in Thai Language. In this paper, we experimented and tested a probabilistic language model equipped with word location information, the so-called Topic N-grams model (TNG). We deployed several testing tasks to assess TNG's capabilities of modeling the generative process of Thai text and established benchmarks that compare the performance of LDA and TNG for various NLP tasks in Thai language. To our knowledge, this paper is the first to explore word-order model in Thai language topic extraction. We concluded that TNG can help boosting performance of Thai language processing in word cutting, semantic checking, word prediction, and document generation task. We also explored how we can measure performance of LDA and TNG on such tasks using perplexity. ฉ 2018 IEEE.


Keywords

LDATNGTopic ModelingTopic N-gramsWord Cutting


Last updated on 2023-17-10 at 07:35