SGECluster and resilience

A general question: if there is some temporary hardware failure in one node, is there a way for SGECluster to start a new node (automatically submit a new job) and submit job over there? Thanks.

You can use the Adaptive functionnality to achieve that, but keep in mind that your workflow have to be resilient to this kind of failure.

Thanks. I thought adapt() only change number of workers based on number of tasks. I didn’t know it has the “resilience” feature to resubmit failed jobs due to failed hardwares. I will definitely try!

Well it should if you use something like cluster.adapt(minimum=10, maximum=10).