A hybrid approach for Thai word segmentation with crowdsourcing feedback system
Conference proceedings article
Authors/Editors
Strategic Research Themes
No matching items found.
Publication Details
Author list: Chaonithi K., Prom-On S.
Publisher: Hindawi
Publication year: 2016
ISBN: 9781467397490
ISSN: 0146-9428
eISSN: 1745-4557
Languages: English-Great Britain (EN-GB)
Abstract
This paper proposes a new hybrid method for Thai word segmentation using crowd-sourced dictionary integrated with word bi-gram model. The main dictionary is extracted into basic and compound word dictionaries to improve dictionary based algorithm performance. The word segmentation process begins with heuristic exhaustive matching algorithm using basic word dictionary to generate all possible basic word sequence candidates from an input string. Then, the best candidate is selected by word bi-gram model to solve ambiguity problem. Finally, the sequence of basic words is combined into compound words with compound word dictionary. Another part of this work is applying crowdsourcing paradigm. We implemented a web application for training bi-gram model and dictionary updates from user feedbacks. This process improves the lexical knowledge of the platform over the time. The algorithm was evaluated with two corpora. With InterBEST 2009 corpus, the proposed algorithm yields average precision, recall and f-measure at 97.52%, 97.70%, and 97.63%. With social network corpus, the proposed method yields average precision, recall and f-measure at 98.47%, 98.59%, and 98.54% respectively. ฉ 2016 IEEE.
Keywords
bi-gram, crowdsourcing, exhaustive matching, Hybrid method, Machine Learning, web service, word segmentation