Dask config - how does it actually work?

Hvuj · September 2, 2024, 9:00am

Hi.
Im trying to learn what is the correct way to use dask.config.
im using distributed environment using k8s on gcp (vms)

scenario:
i have input function logic function and output function.
input - in charge of reading parquet files with different settings
logic - the actual transformation that happens
output - in charge of writing to parquet with different configurations.

i would like to start to use a custom dask.config using env vars , yaml and context manager.

im trying to understand how does it work behind the scenes.

for example:
in the input function i will have a context manager that has read parquet e.g.

with dask.config(some logic):
       return dd.read_parquet(...)

(no compute nor persist)
the ddf is then passed to the logic function which also has with dask.config(some logic):

it could have some times persist, compute , set_index (not lazy one) etc. and some times pure lazy logic.

it goes to the output function which also has

dask.config(some logic):
   dd.to_parquet(some logic)

i do understand that the config tells how the graph is going to be built e.g. what configuration it should use , but what i dont understand which config and when will it use

thanks for the help

jacobtomlinson · September 2, 2024, 9:18am

It would help to understand what you’re trying to do here.

Hvuj · September 2, 2024, 10:45am

im trying to optimize performance depending on the logic in the transformation and squeeze performance when possible.

for example:
if the logic is quite heavy im trying to spill to disk more often - play with worker saturation
(example lets say i use below but it could be any logic here)

            "distributed.worker.memory.spill": 0.1,
            "distributed.worker.memory.target": 0.8,
            "distributed.worker.memory.pause": 0.9,
            "distributed.worker.memory.terminate": 0.95,
            "distributed.worker.memory.spill-compression": "auto",
            "distributed.scheduler.work-stealing": True,
            "distributed.scheduler.worker-saturation": 1.5,
            "dataframe.shuffle.method": "p2p",
            "dataframe.shuffle.compression": "auto",
            "distributed.nanny.environ.MALLOC_TRIM_THRESHOLD_": "1024",

or could be even on the array or dataframe level configurations and etc…

so to give some more context - i can pass from any logic function dynamically dask config dict to the output function.

so what i want is:

have some extended defaults for my k8s cluster (for that i use .yaml)
input function should have some additional logic if we want and if not it will use dask.config.config defaults
output function should have some extended defaults and if necessary use dynamically what we pass from the logic function.

the thing is im not 100% under what conditions which config will be used - depending on the answer i will change the implementation

jacobtomlinson · September 2, 2024, 1:25pm

Unfortunately Dask’s config system is not as granular as this. These kinds of config options are applied when the workers start up and can’t be modified at runtime.

If you’re using dask-kubernetes you can dynamically create Dask clusters with KubeCluster and apply these configs to each cluster.

Hvuj · September 2, 2024, 1:26pm

thx - so to give more context every function has an ephemeral k8s cluster- so essentially im applying new dask config each time? or did you mean only before the worker creation will it work?

what if i use client.restart() e.g. restart workers inside the context manager?

the reason i ask since i do see the new default setting that are applied on the cluster on the time of the creation and doing (no context manager or with give same results)

dask.config.set(
        {
            "distributed.worker.memory.spill": 0.1,
            "distributed.worker.memory.target": 0.8,
            "distributed.worker.memory.pause": 0.9,
            "distributed.worker.memory.terminate": 0.95,
            "distributed.worker.memory.spill-compression": "auto",
            "distributed.scheduler.work-stealing": True,
            "distributed.scheduler.worker-saturation": 1.5,
            "dataframe.shuffle.method": "p2p",
            "dataframe.shuffle.compression": "auto",
            "distributed.nanny.environ.MALLOC_TRIM_THRESHOLD_": "1024",
        }
    )

and then logging the dask.config.config AFTER the cluster is already setup -i do see the new settings so im wondering here what is going on? (same with dask.config.get)

jacobtomlinson · September 2, 2024, 2:29pm

If you’re setting those config options on the client they won’t have any effect on the workers. You need to set them on the workers themselves.

Currently dask-kubernetes does not support config forwarding from the client to the workers. If this is a feature that would be valuable to you I recommend you open an issue on the dask-kubernetes repo. This would be a great feature to add, but we haven’t had anyone ask for it yet so it hasn’t been implemented.

In the meantime you can set the config as environment variables directly on the workers via the env kwarg.

from dask_kubernetes.operator import KubeCluster

cluster = KubeCluster(
    name="foo", 
    image="bar", 
    env={"DASK_DISTRIBUTED__WORKER__MEMORY__SPILL": 0.1, ...},
)

Hvuj · September 2, 2024, 2:47pm

thank you - yea ill add an issue and yea ill use env vars -
final question:
if i use

new_config = dask.config.collect(paths=[config_path])

dask.config.update(dask.config.config,new_config)

does it go directly to the workers?

jacobtomlinson · September 4, 2024, 4:00pm

No it doesn’t. Unfortunately Dask doesn’t modify config on remote workers at runtime.

Topic		Replies	Views
How does the dask config works behind the scenes? Distributed kubernetes , distributed	1	26	September 11, 2024
Setting Dask Distributed config variables when deploying Dask Gateway with Helm Deploying Dask dask-gateway , distributed	3	911	May 16, 2022
Best practices for user configuration configuration	3	222	September 4, 2022
Disable the warning "distributed.utils_perf - WARNING - full garbage collections took 23% CPU time recently (threshold: 10%)" Distributed distributed	8	2240	March 17, 2023
Only one worker out of seven carry out the workload Deploying Dask kubernetes	2	296	December 12, 2022

Dask config - how does it actually work?

Related topics