Not a terribly specific questions, but we are hosting an internal platform to run large data jobs (regularly 200+ CPUs), and we often have internal users who don’t want to spend time thinking on the best configurations (e.g.,
nthread for the nanny config), how many CPUs a single worker should have, how big the scheduler should be, and whether the data should be repartitioned etc.
Have anyone tried to instrument these usage guardrails? Any pointers would be very helpful!
Hi @yifanwu, welcome to Dask community!
Well, it’s a subject we’ve been discussing for long in the Dask/Pangeo community, especially in Europe where we often depend on HPC systems more than Cloud computing platforms.
In the end, we didn’t yet reach our goal of building the kind of guidelines you’re talking about…
Some articles or videos might be interesting:
There are probably other talks or slides talking about the subject, but I didn’t find them in a few minutes.
If I sum up quickly some thoughts about that, that would be:
- Threads vs Processes: all the cores from a server should be used: one thread per core. Splitting on several processes depends on the kind of tasks you have (you can find guidelines here). By default Dask will use processes=sqrt(threads), but you might want to test with only one thread per process or the opposite to see how it affects your workflow. There is also the matter of how much memory per process you’ll have at the end.
- Chunk/Partition size: this can really affect your workflow, so chose wisely. There is a blog article about that here.
- File format (an related chunking or partitioning): choose a format optimized for your workflow (e.g. Parquet for DataFrame, Zarr for Array for example). And make sure it is partitioned appropriately, aligned with the calculation you want to perform on it. You might want to first repartition or rewrite a dataset in another format to speed up your analysis later on.
- Scheduler: it is not currently multi-threaded, so one or two cores should be enough. The memory you’ll need depend on the complexity and size of your tasks graph. But I’d say it’s rare to need more from 8 or 16 GiB of memory for the Scheduler process. See this Stackoverflow answer for some insights on the Scheduler memory behavior.
Hope that helps!