Memoizing External to Cluster Table Reads

We have a cluster where workers will reference one of 4 tables from a database. There is little point in having each table be read by each worker from the remote database, potentially 4 x n_workers database reads. I would like to either pre-read these tables and push their dask dataframes into client.set_metadata() or to have them be dynamically read by a worker and registered with client.set_metadata(), a.k.a. memoization. Or is there a different protocol for making data globally readable by workers?

Anon,
Andrew

Looks like client.get_dataset() and its peers will likely solve my problem. I’m not sure how they handle two workers trying to update the same key. More adventures.

Anon,
Andrew

Hi @adonoho,

It’s a bit hard to give an advice here withtout knowing more about your data and use case.

It might be better to have all the Workers directly read the tables, or part of the tables from the database. Preloading on client side and then sharing data accross worker is generally not recommended, and can be good only if the data you share is small enough.

How big is the data you read from the database?

What do you want to do with it, is it only static data?

First, thank you for helping with my problem.

It might be better to have all the Workers directly read the tables, or part of the tables from the database. Preloading on client side and then sharing data accross worker is generally not recommended, and can be good only if the data you share is small enough.

Each worker executes an embarrassingly parallel function and may read one of several tables stored in Google BigQuery. We expect to have several hundred workers, each with 4GB RAM, perform an analysis on the tables. GBQ has been sensitive to us having multiple nodes writing directly to it. Hence, I am assuming that it will be equally sensitive to many reads. As the parameter search will rotate through the tables, we cannot guarantee that any one worker will always read from the same table. 4 reads from the DB versus 4 * n_workers seems like a worthwhile win.

How big is the data you read from the database?

One of the tables is over 100 MB in size. For our purposes, multiply by 4 for an upper bound.

I was assuming that distributed datasets are like a key value store. Staying within the cluster is much cheaper than leaving the cluster. IOW, I expected the data to migrate off of the Client machine and into the cluster as it is being used.

What do you want to do with it, is it only static data?

Yes, I would prefer that the data remain static throughout the cluster’s lifetime.

Can I depend upon the get_dataset() to deliver a copy of the dataset or do I need to deep copy it when I provide it to my worker code? (Currently, I deep copy before I turn it over to the analyst’s code.)

I am just about to start doing cluster testing on our system. It works fine on a LocalCluster.

Anon,
Andrew

P.S. If you celebrate, Happy Thanksgiving.

Since it’s a Dask collection, it is immutable, you should republish it to update it I think.

However, with your use case, I would probably juste Delayed the data, or use scatter method to be able to have some future pointing to it and reuse the data within tasks.

Thank you for sharing that idea about Delay-ing the data. As I am new to Dask, I will explore further. BTW, get_dataset() is working quite well with a memoized local getter. Yes, this means that each node ends up holding a copy of each data set.

Anon,
Andrew

But in any case you won’t be able to avoid that, which is perfectly okay in your workflow!