Column optimzation

martindurant · May 1, 2023, 8:33pm

I the code

df = <input func>
df.groupby(field1).field2.mean().compute()

dask fails to recognize that the output requires only the two mentioned fields, and all are loaded, even if the IO layer supports loading only selected layers.

The code at Dataframe column optimization · GitHub shows a primitive and naive way to determine which columns are required to perform the operation. It works for the given snippet, but is very slow. It probably fails for more complex things.

Thoughts?

guillaumeeb · May 2, 2023, 12:10pm

Hi @martindurant!

Clearly, you’ve got a lot more background about the possible optimizations here than me.

In my mind, it’s not so bad that Dask doesn’t know how to optimize this, as this can be done on user side if needed. I agree it’s always better to have automated optimizations, but it’s always a tradeoff with on the cost on Dask code side.

Does Pandas have optimizations for this?

cc @rjzamora @mrocklin.

martindurant · May 2, 2023, 2:06pm

Since pandas is eager, in memory, there is no chance for it to do this kind of optimization.

To be sure, dask already has column pushdown that takes code complexity and CPU time when computing, it just doesn’t work for anything but the very simplest case of load followed by select only. In that case, it very much makes sense to tell users to select, since they are doing that anyway, but in more complex situations, users will have a harder time back-propagating column usage in graph.

martindurant · May 2, 2023, 2:06pm

I should point out that any SQL system (including the likes of spark) does this routinely and have for decades.

Topic		Replies	Views
How does dataframe column projection optimization work? Dask DataFrame	2	412	July 14, 2022
Performance of Dask DataFrames for Feature Engineering Dask DataFrame	9	1169	March 2, 2023
When adding new columns to dataframes, accessing columns gets slower because all new columns are always computed Dask DataFrame	6	948	October 9, 2023
Most efficient way to implement custom functions on a column(like .mean()) Dask DataFrame distributed	6	1909	June 10, 2022
High Level Query Optimization in Dask Blogs	0	174	September 5, 2023

Column optimzation

Related topics