Failure detection through monitoring of the scientific distributed system
Conference proceedings article
Authors/Editors
Strategic Research Themes
No matching items found.
Publication Details
Author list: Yamnual K., Phunchongharn P., Achalakul T.
Publisher: Hindawi
Publication year: 2017
Start page: 568
End page: 571
Number of pages: 4
ISBN: 9781509048977
ISSN: 0146-9428
eISSN: 1745-4557
Languages: English-Great Britain (EN-GB)
Abstract
Performance monitoring is essential for all subsystems, especially high performance computing systems. These systems are sensitive to errors and failures which lead to data losses and then severely impact on the organizations. Consequently, resource information in the systems (e.g., CPU usage, memory usage, disk I/O usage, etc.) during the operations must be collected through the system monitoring in order to use for failure identification. However, a traditional monitoring system cannot detect the failures. Since failure discovered later in the operation are more difficult and more expensive to recover, we highly desire to detect the failure as early as possible. In this paper, we propose a proactive failure detection framework based on a monitoring system for the high performance computing systems. Our proposed monitoring system is based on Elasticsearch-Logstash-Kibana (ELK), which has the task of gathering information from the scientific distributed system. Then we propose the failure detection method using three machine learning techniques (i.e, Support Vector Machine, Radom Forest, and Na๏ve Bayes). From the experimental results, SVM classifier can provide the highest accuracy with 90%. To this end, the proposed framework integrating ELK and SVM classifier can accurately provide a proactive approach for failure detection in the high performance computing systems, especially, the scientific distributed systems. ฉ 2017 IEEE.
Keywords
Distributed Systems, Elasticsearch-Logstash-Kibana, Failure Detection, High Performance Computing Systems, Scientific Computing Systems