Failure detection through monitoring of the scientific distributed system

Conference proceedings article


Authors/Editors


Strategic Research Themes

No matching items found.


Publication Details

Author listYamnual K., Phunchongharn P., Achalakul T.

PublisherHindawi

Publication year2017

Start page568

End page571

Number of pages4

ISBN9781509048977

ISSN0146-9428

eISSN1745-4557

URLhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-85028549391&doi=10.1109%2fICASI.2017.7988485&partnerID=40&md5=bf3b49b47a425b23e14afbb2f7f4985b

LanguagesEnglish-Great Britain (EN-GB)


View on publisher site


Abstract

Performance monitoring is essential for all subsystems, especially high performance computing systems. These systems are sensitive to errors and failures which lead to data losses and then severely impact on the organizations. Consequently, resource information in the systems (e.g., CPU usage, memory usage, disk I/O usage, etc.) during the operations must be collected through the system monitoring in order to use for failure identification. However, a traditional monitoring system cannot detect the failures. Since failure discovered later in the operation are more difficult and more expensive to recover, we highly desire to detect the failure as early as possible. In this paper, we propose a proactive failure detection framework based on a monitoring system for the high performance computing systems. Our proposed monitoring system is based on Elasticsearch-Logstash-Kibana (ELK), which has the task of gathering information from the scientific distributed system. Then we propose the failure detection method using three machine learning techniques (i.e, Support Vector Machine, Radom Forest, and Na๏ve Bayes). From the experimental results, SVM classifier can provide the highest accuracy with 90%. To this end, the proposed framework integrating ELK and SVM classifier can accurately provide a proactive approach for failure detection in the high performance computing systems, especially, the scientific distributed systems. ฉ 2017 IEEE.


Keywords

Distributed SystemsElasticsearch-Logstash-KibanaFailure DetectionHigh Performance Computing SystemsScientific Computing Systems


Last updated on 2023-29-09 at 07:35