Modeling of checkpointing/rollback strategy towards optimal run time in parallel applications

Authors

  • Samir Jafar
  • Mohammed Mounaf Al-hamad
  • Rahaf Ghazal

Abstract

We present a mathematical model of checkpointing/rollback strategy, in order to ensure that execution of parallel applications in High Performance Computing (HPC) platform are completed in as little time as possible, which is achieved through  minimize the computations loss due to expected failures or unnecessary overhead of fault tolerant mechanisms.

In our study, we are interested in special failure of components, which is called (crash fault), that shows a constant behavior of system during the work, either failure or work at for a moment, and we study a coordinated checkpointing strategy for fault tolerance to achieve continuity of the application despite the failures.

 

 

Published

2019-07-03

How to Cite

1.
Jafar S, Mounaf Al-hamad M, Ghazal R. Modeling of checkpointing/rollback strategy towards optimal run time in parallel applications. TUJ-BA [Internet]. 2019Jul.3 [cited 2024Nov.24];41(3). Available from: https://journal.tishreen.edu.sy/index.php/bassnc/article/view/8823