Attention-based Deep Survival Analysis for ALICE O2 Facilities
Conference proceedings article
Authors/Editors
Strategic Research Themes
Publication Details
Author list: Suthawee Weraphong, Vasco Chibante Barroso, Phond Phunchongharn, Peerapon Siripongwutikorn
Publication year: 2025
Abstract
Predicting system failures in large-scale computing infrastructures remains a critical challenge for maintaining
operational reliability. This research introduces the first attention-based deep survival analysis model for predicting out-of-memory (OOM) events in ALICE O² (Online-Offline) facilities, including First Level Processor nodes, located at LHC Point 2 at CERN. Unlike conventional survival methods that assume linear relationships and static covariates, our approach integrates bidirectional LSTM networks with multi-head attention mechanisms to capture temporal dependencies in system log data. The attention mechanism enables identification of memory stress patterns, providing actionable insights for predictive maintenance. We analyze memory performance logs from 201 First Level Processor nodes over 45 days, targeting critical 15-60 minute windows before the failure events. The proposed model achieves a concordance index of 0.792 for ALICE data and 0.668 for Google cloud cluster data, with the latter outperforming all the other methods in the comparison. This work demonstrates the effectiveness of attention-based temporal modeling for system reliability prediction in distributed computing infrastructures.
Keywords
No matching items found.






