Failure detection through monitoring of the scientific distributed system

Conference proceedings article

Authors/Editors

TIRANEE ACHALAKUL

Strategic Research Themes

No matching items found.

Publication Details

Author list: Yamnual K., Phunchongharn P., Achalakul T.

Publisher: Hindawi

Publication year: 2017

Start page: 568

End page: 571

Number of pages: 4

ISBN: 9781509048977

ISSN: 0146-9428

eISSN: 1745-4557

URL: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85028549391&doi=10.1109%2fICASI.2017.7988485&partnerID=40&md5=bf3b49b47a425b23e14afbb2f7f4985b

Languages: English-Great Britain (EN-GB)

View on publisher site

Abstract

Performance monitoring is essential for all subsystems, especially high performance computing systems. These systems are sensitive to errors and failures which lead to data losses and then severely impact on the organizations. Consequently, resource information in the systems (e.g., CPU usage, memory usage, disk I/O usage, etc.) during the operations must be collected through the system monitoring in order to use for failure identification. However, a traditional monitoring system cannot detect the failures. Since failure discovered later in the operation are more difficult and more expensive to recover, we highly desire to detect the failure as early as possible. In this paper, we propose a proactive failure detection framework based on a monitoring system for the high performance computing systems. Our proposed monitoring system is based on Elasticsearch-Logstash-Kibana (ELK), which has the task of gathering information from the scientific distributed system. Then we propose the failure detection method using three machine learning techniques (i.e, Support Vector Machine, Radom Forest, and Na๏ve Bayes). From the experimental results, SVM classifier can provide the highest accuracy with 90%. To this end, the proposed framework integrating ELK and SVM classifier can accurately provide a proactive approach for failure detection in the high performance computing systems, especially, the scientific distributed systems. ฉ 2017 IEEE.

Keywords

Distributed Systems, Elasticsearch-Logstash-Kibana, Failure Detection, High Performance Computing Systems, Scientific Computing Systems