How to handle job migration of 3rd party tasks?

evankomp · November 9, 2022, 6:35pm

I have a use case where I am submitting jobs to SLURM in order to run third party, multithreaded software, with one task per job. My use case is very similar to:

github.com/dask/dask-jobqueue

Multiple cores per process/thread

opened 12:25AM - 23 Oct 18 UTC

closed 10:05AM - 04 Feb 19 UTC

dgasmith

enhancement usage question

We have a use case where we would like to dask.distributed to parallelize a pyth…on-bound C++ program that would work best if it could consume 8-32 threads depending on the problem size and will manage threading internally. Normally, a Dask worker is run on a node that has 32 cores with the worker using 2 processes at 1 thread each so that we can give each process 16 cores. Looking at the code currently, it seems that `nthreads = ncores / nprocesses` without exceptions, is there a canonical way to change this so that we can orchestrate our normal Dask worker operation with dask-jobqueue?

github.com/dask/dask-jobqueue

SLURM cluster only schedules one task on 20 workers (19 idle)

opened 06:53PM - 01 Feb 19 UTC

closed 06:44AM - 30 Aug 22 UTC

wkerzendorf

SLURM

The general idea is to use dask to schedule an embarrassingly parallel problem w…here each task requires 8 cores (is threaded via OpenMP). That means that one worker should only take one task. This started here #181 - I'm now running my cluster like this ```python cluster = SLURMCluster(walltime='01:00:00', memory='7 GB', job_extra=['--nodes=1', '--ntasks-per-node=1', '--cpus-per-task=8'], cores=8, extra=['--resources processes=1']) client = Client(cluster) ``` resulting in ```bash #!/bin/bash #!/usr/bin/env bash #SBATCH -J dask-worker #SBATCH -n 1 #SBATCH --cpus-per-task=8 #SBATCH --mem=7G #SBATCH -t 01:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 JOB_ID=${SLURM_JOB_ID%;*} /home/wek224/.conda/envs/tardis3/bin/python -m distributed.cli.dask_worker tcp://172.16.2.152:45751 --nthreads 8 --memory-limit 7.00GB --name dask-worker--${JOB_ID}-- --death-timeout 60 --resources processes=1 ``` I define the following task: ``` def test_task(param_id): # cur_uuid = str(uuid.uuid4()) cur_uuid = param_id import time print("\n\n################### STARTING NEW TASK ##############", cur_uuid, '#########') for i in range(12): print(cur_uuid, i, 30) time.sleep(5) return param_id ``` I submit jobs using this command: `futures = [client.submit(test_task, param_id, resources={'processes':1}) for param_id in range(10000)]` But it seems that only one worker is actually doing anything while the other workers are completely idle (tail-fing the slurm out files)

This is generally done by passing arguments to the Cluster to make dask think that there is one thread, but job keyword arguments to make the scheduler book multiple cpus, thus dask only sends one job and the called software can use all cores.

However, the main question of this post - since dask is not actually running the task, just keeping track of the inputs, outputs, and completion state, migration to a new worker requires a full restart of the task, how can this be circumvented? This is suboptimal because if a task starts on a worker which is soon gracefully killed to avoid SLURM timeout, the current state of the computation is not actually migrated to the new worker, instead we just waste that compute.

Two solutions I can see but do not know how to implement:

graciously stop worker at the end of every a single task from within the worker. eg a new slurm job is submitted for each task
migrate the state of the third party software

Any suggestions?
Thanks for assistance.

EDIT:
I tried to attack strategy 1 using a worker plugin:

class KillerNannyPlugin(distributed.diagnostics.plugin.WorkerPlugin):
    """Kills worker after task is completed.
    
    Transitions states "memory" or "error" occur after the "executing" state and trigger this
    plugin. Ensures that each task gets a new Worker.
    
    This should be a nanny plugin to be more dask friendly, but those don't trigger transtions
    as of 11.08.22
    
    
    Parameters
    ----------
    max_stagger_seconds - float
        attentuates how long to wait after task before closing worker. 
        actual wait time is 3 + max_stagger_seconds * X where X is drawn from [0,1]
        ensures that data is not lost to workers closing at the same time.
    """
    def __init__(self, max_stagger_seconds: float = 10):
        self.max_stagger_seconds = max_stagger_seconds
    
    def setup(self, worker):
        self.worker = worker
        
    def transition(self, key, start, finish, *args, **kwargs):
        if start == 'memory' and finish == 'released':
            self.worker.io_loop.call_later(1+ random.random() * self.max_stagger_seconds, self.worker.close_gracefully)

This successfully causes the worker to close after it completes a task. I had to stagger it such that workers ending at the same time did not happen. The side effect of this is that about 20% of tasks are repeated using this strategy, I assume because a worker was killed before it could send the results back to as_completed, which defeats the original purpose of trying not to waste computation time.

If this is the case, it could be that my start and finish transition events are off, is there a way to ensure the result has been gathered before closing the worker?

Again I appreciate any help.

guillaumeeb · November 16, 2022, 11:10am

Hi @evankomp,

I answered in Restart cluster job on task completion · Issue #597 · dask/dask-jobqueue · GitHub. I think the best strategy would be to improve the --lifetime option handling to take into account the case where we want to wait for the end of a task before stopping a worker, as mentioned in Enhancement Request - Dask Workers lifetime option not waiting for job to finish · Issue #3141 · dask/distributed · GitHub.

I didn’t have the time to look deeply in your proposed solution of a Plugin though.

Topic		Replies	Views
Ensuring Each Dask Task Starts on a New SLURM Job with a Limit of 5 Concurrent Jobs Distributed distributed	2	204	October 27, 2023
dask_jobqueue.SLURMCluster: multi-threaded workloads and the effect of setting "cores" Distributed dask-jobqueue , distributed	2	271	November 9, 2023
Restarting workers on Slurm Cluster Distributed dask-jobqueue	4	208	September 26, 2023
Create a (slurm) cluster with different job submission parameters Deploying Dask dask-jobqueue	15	953	January 19, 2024
Parallelisation by multiprocessing not multithreading on SLURMCluster Distributed	1	335	April 23, 2022

How to handle job migration of 3rd party tasks?

Related topics