JCSE, vol. 16, no. 4, pp.222-232, 2022
DOI: http://dx.doi.org/10.5626/JCSE.2022.16.4.222
A Study of Job Failure Prediction on Supercomputers with Application Semantic Enhancement
Haotong Zhang, Gang Xian, Wenxiang Yang, and Jie Yu
College of Information Science and Engineering, Chongqing Jiaotong University, Chongqing, China
Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Mianyang, China
Computational Aerodynamics Institute, China Aerodynamics Research and Development Center, Mianyang, China
State Key Laboratory of Aerodynamics, China Aerodynamics Research and Development Center, Mianyang, China
Abstract: The powerful computing capabilities of supercomputers play an important role in today???scientific computing. A large
number of high performance computing jobs are submitted and executed concurrently in the system. Job failure will
cause a waste of system resources and impact the efficiency of the system and user jobs. Job failure prediction can support
fault-tolerant technology to alleviate this phenomenon in supercomputers. At present, the related work mainly predicts
job failure by collecting the real-time performance attributes of jobs, but it is difficult to be applied in the real
environment because of the high cost of collecting job attributes. In addition to analyzing the time and resource attributes
in the job logs, this study also explores the semantic information of jobs. We mine job application semantic information
from job names and job paths, where job path is collected by additional monitoring of the job submitting process. A prediction
method based on job application semantic enhancement is proposed, and the prediction results of the non-ensemble
learning algorithm and the ensemble learning algorithm are compared under each evaluation indicator. This
prediction method requires more miniature feature collection and computation overhead and is easy to apply. The experimental
results show that the prediction effect is promisingly improved with job application semantic enhancement, and
the final evaluation indicator S score is improved by 5%-6%, of which is 88.16% accuracy with 95.23% specificity and
88.24% sensitivity.
Keyword:
Execution Efficiency; Job Failure Prediction; Application Semantic Enhancement; Machine Learning
Full Paper: 122 Downloads, 772 View
|