Attention-based Deep Survival Analysis for ALICE O2 Facilities
Conference proceedings article
ผู้เขียน/บรรณาธิการ
กลุ่มสาขาการวิจัยเชิงกลยุทธ์
รายละเอียดสำหรับงานพิมพ์
รายชื่อผู้แต่ง: Suthawee Weraphong, Vasco Chibante Barroso, Phond Phunchongharn, Peerapon Siripongwutikorn
ปีที่เผยแพร่ (ค.ศ.): 2025
บทคัดย่อ
Predicting system failures in large-scale computing infrastructures remains a critical challenge for maintaining
operational reliability. This research introduces the first attention-based deep survival analysis model for predicting out-of-memory (OOM) events in ALICE O² (Online-Offline) facilities, including First Level Processor nodes, located at LHC Point 2 at CERN. Unlike conventional survival methods that assume linear relationships and static covariates, our approach integrates bidirectional LSTM networks with multi-head attention mechanisms to capture temporal dependencies in system log data. The attention mechanism enables identification of memory stress patterns, providing actionable insights for predictive maintenance. We analyze memory performance logs from 201 First Level Processor nodes over 45 days, targeting critical 15-60 minute windows before the failure events. The proposed model achieves a concordance index of 0.792 for ALICE data and 0.668 for Google cloud cluster data, with the latter outperforming all the other methods in the comparison. This work demonstrates the effectiveness of attention-based temporal modeling for system reliability prediction in distributed computing infrastructures.
คำสำคัญ
ไม่พบข้อมูลที่เกี่ยวข้อง






