Hi.
Im trying to learn what is the correct way to use dask.config.
im using distributed environment using k8s on gcp (vms)
scenario:
i have input function logic function and output function.
input - in charge of reading parquet files with different settings
logic - the actual transformation that happens
output - in charge of writing to parquet with different configurations.
i would like to start to use a custom dask.config using env vars , yaml and context manager.
im trying to understand how does it work behind the scenes.
for example:
in the input function i will have a context manager that has read parquet e.g.
with dask.config(some logic):
return dd.read_parquet(...)
(no compute nor persist)
the ddf is then passed to the logic function which also has with dask.config(some logic):
it could have some times persist, compute , set_index (not lazy one) etc. and some times pure lazy logic.
i do understand that the config tells how the graph is going to be built e.g. what configuration it should use , but what i dont understand which config and when will it use
im trying to optimize performance depending on the logic in the transformation and squeeze performance when possible.
for example:
if the logic is quite heavy im trying to spill to disk more often - play with worker saturation
(example lets say i use below but it could be any logic here)
Unfortunately Dask’s config system is not as granular as this. These kinds of config options are applied when the workers start up and can’t be modified at runtime.
If you’re using dask-kubernetes you can dynamically create Dask clusters with KubeCluster and apply these configs to each cluster.
thx - so to give more context every function has an ephemeral k8s cluster- so essentially im applying new dask config each time? or did you mean only before the worker creation will it work?
what if i use client.restart() e.g. restart workers inside the context manager?
the reason i ask since i do see the new default setting that are applied on the cluster on the time of the creation and doing (no context manager or with give same results)
and then logging the dask.config.config AFTER the cluster is already setup -i do see the new settings so im wondering here what is going on? (same with dask.config.get)
If you’re setting those config options on the client they won’t have any effect on the workers. You need to set them on the workers themselves.
Currently dask-kubernetes does not support config forwarding from the client to the workers. If this is a feature that would be valuable to you I recommend you open an issue on the dask-kubernetes repo. This would be a great feature to add, but we haven’t had anyone ask for it yet so it hasn’t been implemented.
In the meantime you can set the config as environment variables directly on the workers via the env kwarg.