First, thank you for helping with my problem.
It might be better to have all the Workers directly read the tables, or part of the tables from the database. Preloading on client side and then sharing data accross worker is generally not recommended, and can be good only if the data you share is small enough.
Each worker executes an embarrassingly parallel function and may read one of several tables stored in Google BigQuery. We expect to have several hundred workers, each with 4GB RAM, perform an analysis on the tables. GBQ has been sensitive to us having multiple nodes writing directly to it. Hence, I am assuming that it will be equally sensitive to many reads. As the parameter search will rotate through the tables, we cannot guarantee that any one worker will always read from the same table. 4 reads from the DB versus 4 * n_workers
seems like a worthwhile win.
How big is the data you read from the database?
One of the tables is over 100 MB in size. For our purposes, multiply by 4 for an upper bound.
I was assuming that distributed datasets are like a key value store. Staying within the cluster is much cheaper than leaving the cluster. IOW, I expected the data to migrate off of the Client
machine and into the cluster as it is being used.
What do you want to do with it, is it only static data?
Yes, I would prefer that the data remain static throughout the cluster’s lifetime.
Can I depend upon the get_dataset()
to deliver a copy of the dataset or do I need to deep copy it when I provide it to my worker code? (Currently, I deep copy before I turn it over to the analyst’s code.)
I am just about to start doing cluster testing on our system. It works fine on a LocalCluster
.
Anon,
Andrew
P.S. If you celebrate, Happy Thanksgiving.