Attention-based Deep Survival Analysis for ALICE O2 Facilities

Conference proceedings article


ผู้เขียน/บรรณาธิการ


กลุ่มสาขาการวิจัยเชิงกลยุทธ์


รายละเอียดสำหรับงานพิมพ์

รายชื่อผู้แต่งSuthawee Weraphong, Vasco Chibante Barroso, Phond Phunchongharn, Peerapon Siripongwutikorn

ปีที่เผยแพร่ (ค.ศ.)2025


บทคัดย่อ

Predicting system failures in large-scale computing infrastructures remains a critical challenge for maintaining

operational reliability. This research introduces the first attention-based deep survival analysis model for predicting out-of-memory (OOM) events in ALICE O² (Online-Offline) facilities, including First Level Processor nodes, located at LHC Point 2 at CERN. Unlike conventional survival methods that assume linear relationships and static covariates, our approach integrates bidirectional LSTM networks with multi-head attention mechanisms to capture temporal dependencies in system log data. The attention mechanism enables identification of memory stress patterns, providing actionable insights for predictive maintenance. We analyze memory performance logs from 201 First Level Processor nodes over 45 days, targeting critical 15-60 minute windows before the failure events. The proposed model achieves a concordance index of 0.792 for ALICE data and 0.668 for Google cloud cluster data, with the latter outperforming all the other methods in the comparison. This work demonstrates the effectiveness of attention-based temporal modeling for system reliability prediction in distributed computing infrastructures.


คำสำคัญ

ไม่พบข้อมูลที่เกี่ยวข้อง


อัพเดทล่าสุด 2025-04-12 ถึง 00:00