Probabilistic learning models for topic extraction i Thai language

Conference proceedings article

Authors/Editors

WARASINEE CHAISANGMONGKON

Strategic Research Themes

No matching items found.

Publication Details

Author list: Asawaroengchai C., Chaisangmongkon W., Laowattana D.

Publisher: Hindawi

Publication year: 2018

Start page: 35

End page: 40

Number of pages: 6

ISBN: 9781538652541

ISSN: 0146-9428

eISSN: 1745-4557

URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85050154385&doi=10.1109%2fICBIR.2018.8391162&partnerID=40&md5=3d80c497c8e517ef843d8b0288f276c2

Languages: English-Great Britain (EN-GB)

View on publisher site

Abstract

Natural language processing (NLP) in Thai language is notoriously complicated. One major problem is the lack of word boundary in a sentence, introducing ambiguity in word tokenization. For topic extraction, semantic ambiguity adds another layer of complexity to the problem. Topic model that disregards word order, such as Latent Dirichlet Allocation (LDA), performs poorly in Thai Language. In this paper, we experimented and tested a probabilistic language model equipped with word location information, the so-called Topic N-grams model (TNG). We deployed several testing tasks to assess TNG's capabilities of modeling the generative process of Thai text and established benchmarks that compare the performance of LDA and TNG for various NLP tasks in Thai language. To our knowledge, this paper is the first to explore word-order model in Thai language topic extraction. We concluded that TNG can help boosting performance of Thai language processing in word cutting, semantic checking, word prediction, and document generation task. We also explored how we can measure performance of LDA and TNG on such tasks using perplexity. ฉ 2018 IEEE.

Keywords

LDA, TNG, Topic Modeling, Topic N-grams, Word Cutting