I have some common configuration information that needs to be used in node computation. How can I pass this common information to all nodes?
Hi @abbydev, welcome to Dask Discourse forum,
I’m not sure that writing in your title “I’m in a bit in a hurry, thank you.” will really help you getting an answer.
Moreover, your post is a little short and its hard to understand with so few words and without an example what you want to achieve. There could be several answers, using scatter, writing a WorkerPlugin, a preload script…
Hi @guillaumeeb Thank you very much for your reply! I just came into contact with dask not long ago, but I am deeply fascinated!
My requirement is: I have some public configurations that are used in distributed computing, The pseudo code is as follows:
import dask # Set global configuration options dask.config.set(some_option='value') def my_func(partition): # Use global configuration information in functions option_value = dask.config.get('some_option') # Do my thing here ... # Use map_partitions function and pass custom function result = df.map_partitions(my_func)
In spark, if you need to pass some configuration or data to distributed computing nodes, you can use broadcast. Is the corresponding method of dask is scatter?
Using dask.config will work if you are using the Threaded scheduler.
If you are using a distributed cluster, then a good solution could be to use Variable.
A quick example:
x = Variable('some_option') x.set("A good option") def get_variable(): return Variable('some_option').get() client.submit(get_variable).result()
You can also use scatter, but then you’ll need to modify your function to take the scattered configuration object in argument.
There are several other options, like using
client.run to store some configuration on Worker, using WorkerPlugin, etc.
Okay, thank you very much for your reply. I will try it in different scenarios.