Hello everyone,
I am working on an HPC cluster and been using the batch runner of dask-jobqueue. I have to use it because ssh tunneling is disabled for supposedly security reason. Furthermore admins prefer large job with one or more node and will manually downgrade priority for user who launch a lot of small jobs on share mode.
I am currently using this script :
#!/bin/bash
#SBATCH --job-name=dask
#SBATCH --constraint=GENOA
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --time=1:00:00
### handle environement
module purge
module load cpe/23.12
module load cray-python
module load conda
module list
conda activate my_env
# Fail on first error
set -eu
set -x
# Check if logs exists and is a directory, create if needed
if [ -e "logs" ] && [ ! -d "logs" ]; then
echo "Error: 'logs' exists but is not a directory" >&2
exit 1
elif [ ! -d "logs" ]; then
mkdir -p logs || {
echo "Error: Failed to create logs directory" >&2
exit 1
}
fi
srun --exclusive --nodes=1 \
--ntasks-per-node=192 \
--cpus-per-task=1 \
--threads-per-core=1 \
python src/features/my_script.py --execution_mode batch_runner \
2>&1 | tee "logs/run_${SLURM_JOB_ID}.log"
# Check exit status
if [ $? -ne 0 ]; then
echo "Error: Python script failed"
exit 1
fi
I have multiple question on how the workers are/can be defined:
-
Intuitively, it would make more sense to define workers with multiple core and more memory to avoid the unnecessary communication between a myriad of small workers with one core ( current implementation). can i just define the size of the worker using srun argument ?
-
i can have up to two threads-per-core but i have no idea if it will actually change anything as my code is executed on the single thread but it doesn’t seems that dask based multithreading is an option for a batchrunner . I can do multi-node but i would be fine to do multiple one node jobs too. it would make sense to use multithreading because all the workers memory could be located on the same node and it would be nice to be able to take full advatange of the two logcial thread per core. Is there a particular client/batch runner option to enable it?
-
Is my best option is to to actually launch multiple local cluster with srun if i can partition my jobs as smaller tasks ?