Memoizing External to Cluster Table Reads

adonoho · November 21, 2023, 2:00am

We have a cluster where workers will reference one of 4 tables from a database. There is little point in having each table be read by each worker from the remote database, potentially 4 x n_workers database reads. I would like to either pre-read these tables and push their dask dataframes into client.set_metadata() or to have them be dynamically read by a worker and registered with client.set_metadata(), a.k.a. memoization. Or is there a different protocol for making data globally readable by workers?

Anon,
Andrew

adonoho · November 21, 2023, 2:38am

Looks like client.get_dataset() and its peers will likely solve my problem. I’m not sure how they handle two workers trying to update the same key. More adventures.

Anon,
Andrew

guillaumeeb · November 22, 2023, 9:32am

Hi @adonoho,

It’s a bit hard to give an advice here withtout knowing more about your data and use case.

It might be better to have all the Workers directly read the tables, or part of the tables from the database. Preloading on client side and then sharing data accross worker is generally not recommended, and can be good only if the data you share is small enough.

How big is the data you read from the database?

What do you want to do with it, is it only static data?

adonoho · November 22, 2023, 11:34pm

First, thank you for helping with my problem.

It might be better to have all the Workers directly read the tables, or part of the tables from the database. Preloading on client side and then sharing data accross worker is generally not recommended, and can be good only if the data you share is small enough.

Each worker executes an embarrassingly parallel function and may read one of several tables stored in Google BigQuery. We expect to have several hundred workers, each with 4GB RAM, perform an analysis on the tables. GBQ has been sensitive to us having multiple nodes writing directly to it. Hence, I am assuming that it will be equally sensitive to many reads. As the parameter search will rotate through the tables, we cannot guarantee that any one worker will always read from the same table. 4 reads from the DB versus 4 * n_workers seems like a worthwhile win.

How big is the data you read from the database?

One of the tables is over 100 MB in size. For our purposes, multiply by 4 for an upper bound.

I was assuming that distributed datasets are like a key value store. Staying within the cluster is much cheaper than leaving the cluster. IOW, I expected the data to migrate off of the Client machine and into the cluster as it is being used.

What do you want to do with it, is it only static data?

Yes, I would prefer that the data remain static throughout the cluster’s lifetime.

Can I depend upon the get_dataset() to deliver a copy of the dataset or do I need to deep copy it when I provide it to my worker code? (Currently, I deep copy before I turn it over to the analyst’s code.)

I am just about to start doing cluster testing on our system. It works fine on a LocalCluster.

Anon,
Andrew

P.S. If you celebrate, Happy Thanksgiving.

guillaumeeb · December 1, 2023, 1:34pm

Since it’s a Dask collection, it is immutable, you should republish it to update it I think.

However, with your use case, I would probably juste Delayed the data, or use scatter method to be able to have some future pointing to it and reuse the data within tasks.

adonoho · December 4, 2023, 9:00pm

Thank you for sharing that idea about Delay-ing the data. As I am new to Dask, I will explore further. BTW, get_dataset() is working quite well with a memoized local getter. Yes, this means that each node ends up holding a copy of each data set.

Anon,
Andrew

guillaumeeb · December 5, 2023, 4:07pm

But in any case you won’t be able to avoid that, which is perfectly okay in your workflow!

Topic		Replies	Views
Dask data sharding future	9	556	January 25, 2022
Dask.read_sql_table() too slow Distributed	7	310	June 16, 2023
Optimising Dask computations (memory implications and communication overhead) Distributed delayed , future , distributed	6	289	October 12, 2023
Use multiple workers to load data into dask.array Dask Array	7	985	April 9, 2022
Computing chunks locally before sending to workers with map_blocks Distributed	1	18	July 18, 2024

Memoizing External to Cluster Table Reads

Related topics