Krathu-500: Post-Comments Thai Corpus
Conference proceedings article
Authors/Editors
Strategic Research Themes
Publication Details
Author list: Pittawat Taveekitworachai, Jonathan H. Chan
Publication year: 2022
URL: https://www.ri2c-2022.com/home.aspx
Languages: English-Canada (EN-CA)
Abstract
The Krathu-500 contains 574 post titles and a post body with all comments on each post on Pantip.com. The corpus contains a total of 63,293 comments in Thai language that is used in real-life situations, with various contexts and types, in a conversational form. The corpus serves as a good way to improve the capabilities of machine learning techniques that deal with the Thai language. A smaller sentiment-labeled corpus of comments is also provided with 6,306 records. The labeled corpus is a human-annotated dataset with three labels for negative, neutral, and positive comments. Three baseline models were developed for labeled corpus using simple LSTM, CNN, and BERT, respectively. The project also consists of an open-source repository that allows anyone who is interested to modify and built on top of the current source code and dataset.
Index Terms—natural language processing, data collection,
Thai corpus, text classification
Keywords
Data collection, natural language processing, text classification, Thai corpus