Probabilistic learning models for topic extraction i Thai language

Conference proceedings article

ผู้เขียน/บรรณาธิการ

วราสิณี ฉายแสงมงคล

กลุ่มสาขาการวิจัยเชิงกลยุทธ์

ไม่พบข้อมูลที่เกี่ยวข้อง

รายละเอียดสำหรับงานพิมพ์

รายชื่อผู้แต่ง: Asawaroengchai C., Chaisangmongkon W., Laowattana D.

ผู้เผยแพร่: Hindawi

ปีที่เผยแพร่ (ค.ศ.): 2018

หน้าแรก: 35

หน้าสุดท้าย: 40

จำนวนหน้า: 6

ISBN: 9781538652541

นอก: 0146-9428

eISSN: 1745-4557

URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85050154385&doi=10.1109%2fICBIR.2018.8391162&partnerID=40&md5=3d80c497c8e517ef843d8b0288f276c2

ภาษา: English-Great Britain (EN-GB)

ดูบนเว็บไซต์ของสำนักพิมพ์

บทคัดย่อ

Natural language processing (NLP) in Thai language is notoriously complicated. One major problem is the lack of word boundary in a sentence, introducing ambiguity in word tokenization. For topic extraction, semantic ambiguity adds another layer of complexity to the problem. Topic model that disregards word order, such as Latent Dirichlet Allocation (LDA), performs poorly in Thai Language. In this paper, we experimented and tested a probabilistic language model equipped with word location information, the so-called Topic N-grams model (TNG). We deployed several testing tasks to assess TNG's capabilities of modeling the generative process of Thai text and established benchmarks that compare the performance of LDA and TNG for various NLP tasks in Thai language. To our knowledge, this paper is the first to explore word-order model in Thai language topic extraction. We concluded that TNG can help boosting performance of Thai language processing in word cutting, semantic checking, word prediction, and document generation task. We also explored how we can measure performance of LDA and TNG on such tasks using perplexity. ฉ 2018 IEEE.

คำสำคัญ

LDA, TNG, Topic Modeling, Topic N-grams, Word Cutting