An Error Detection and Tolerance Framework for Task Parallel Applications on High Performance Computing Systems

Yanfei Fang, Enming Dong, Yanbing Li, Qi Liu, Fengbin Qi, Peibing Du

2022

Abstract

With the high performance computing capability entering the E-level era, the computing scale of the system reaches more than 10 million cores. The mean time between failures of the system is short, which brings great challenges to the reliability of the system. Single processor failure is a common system failure. If the failure can be detected by the system, system-level fault tolerance can be implemented, and fault tolerant processing can be performed through technologies such as checkpoint rollback. However, there are single processor faults that cannot be detected by the system. These faults are manifested as wrong operation results, which cannot be detected by the system. To solve the above problems, an error detection and fault tolerance framework for task parallel applications is proposed. The framework consists of three functions: dynamic task scheduling, error detection, and fault tolerance. During the running process of task parallel applications, error detection is actively initiate. When a node failure is detected, the failed node is discarded. And tasks assigned to the node since the last checkpoint are reassigned to other healthy nodes. The experimental results show that the framework can effectively detect node failures. The fault tolerance can be performed without interrupting the operation of the project, effectively avoiding the time cost caused by the checkpoint rollback technology.

Download


Paper Citation


in Harvard Style

Fang Y., Dong E., Li Y., Liu Q., Qi F. and Du P. (2022). An Error Detection and Tolerance Framework for Task Parallel Applications on High Performance Computing Systems. In Proceedings of the 3rd International Symposium on Automation, Information and Computing - Volume 1: ISAIC; ISBN 978-989-758-622-4, SciTePress, pages 202-208. DOI: 10.5220/0011917800003612


in Bibtex Style

@conference{isaic22,
author={Yanfei Fang and Enming Dong and Yanbing Li and Qi Liu and Fengbin Qi and Peibing Du},
title={An Error Detection and Tolerance Framework for Task Parallel Applications on High Performance Computing Systems},
booktitle={Proceedings of the 3rd International Symposium on Automation, Information and Computing - Volume 1: ISAIC},
year={2022},
pages={202-208},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011917800003612},
isbn={978-989-758-622-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 3rd International Symposium on Automation, Information and Computing - Volume 1: ISAIC
TI - An Error Detection and Tolerance Framework for Task Parallel Applications on High Performance Computing Systems
SN - 978-989-758-622-4
AU - Fang Y.
AU - Dong E.
AU - Li Y.
AU - Liu Q.
AU - Qi F.
AU - Du P.
PY - 2022
SP - 202
EP - 208
DO - 10.5220/0011917800003612
PB - SciTePress