A hybrid approach for Thai word segmentation with crowdsourcing feedback system

Conference proceedings article


Authors/Editors


Strategic Research Themes

No matching items found.


Publication Details

Author listChaonithi K., Prom-On S.

PublisherHindawi

Publication year2016

ISBN9781467397490

ISSN0146-9428

eISSN1745-4557

URLhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-84988893977&doi=10.1109%2fECTICon.2016.7561298&partnerID=40&md5=48c35f448cb25b74b992828e81bcb181

LanguagesEnglish-Great Britain (EN-GB)


View on publisher site


Abstract

This paper proposes a new hybrid method for Thai word segmentation using crowd-sourced dictionary integrated with word bi-gram model. The main dictionary is extracted into basic and compound word dictionaries to improve dictionary based algorithm performance. The word segmentation process begins with heuristic exhaustive matching algorithm using basic word dictionary to generate all possible basic word sequence candidates from an input string. Then, the best candidate is selected by word bi-gram model to solve ambiguity problem. Finally, the sequence of basic words is combined into compound words with compound word dictionary. Another part of this work is applying crowdsourcing paradigm. We implemented a web application for training bi-gram model and dictionary updates from user feedbacks. This process improves the lexical knowledge of the platform over the time. The algorithm was evaluated with two corpora. With InterBEST 2009 corpus, the proposed algorithm yields average precision, recall and f-measure at 97.52%, 97.70%, and 97.63%. With social network corpus, the proposed method yields average precision, recall and f-measure at 98.47%, 98.59%, and 98.54% respectively. ฉ 2016 IEEE.


Keywords

bi-gramcrowdsourcingexhaustive matchingHybrid methodMachine Learningweb serviceword segmentation


Last updated on 2023-26-09 at 07:36