Krathu-500: Post-Comments Thai Corpus

Conference proceedings article


Authors/Editors


Strategic Research Themes


Publication Details

Author listPittawat Taveekitworachai, Jonathan H. Chan

Publication year2022

URLhttps://www.ri2c-2022.com/home.aspx

LanguagesEnglish-Canada (EN-CA)


Abstract

The Krathu-500 contains 574 post titles and a post body with all comments on each post on Pantip.com. The corpus contains a total of 63,293 comments in Thai language that is used in real-life situations, with various contexts and types, in a conversational form. The corpus serves as a good way to improve the capabilities of machine learning techniques that deal with the Thai language. A smaller sentiment-labeled corpus of comments is also provided with 6,306 records. The labeled corpus is a human-annotated dataset with three labels for negative, neutral, and positive comments. Three baseline models were developed for labeled corpus using simple LSTM, CNN, and BERT, respectively.  The project also consists of an open-source repository that allows anyone who is interested to modify and built on top of the current source code and dataset.


Index Terms—natural language processing, data collection,
Thai corpus, text classification


Keywords

Data collectionnatural language processingtext classificationThai corpus


Last updated on 2022-31-08 at 23:05