I the code
df = <input func>
df.groupby(field1).field2.mean().compute()
dask fails to recognize that the output requires only the two mentioned fields, and all are loaded, even if the IO layer supports loading only selected layers.
The code at Dataframe column optimization · GitHub shows a primitive and naive way to determine which columns are required to perform the operation. It works for the given snippet, but is very slow. It probably fails for more complex things.
Thoughts?
Hi @martindurant!
Clearly, you’ve got a lot more background about the possible optimizations here than me.
In my mind, it’s not so bad that Dask doesn’t know how to optimize this, as this can be done on user side if needed. I agree it’s always better to have automated optimizations, but it’s always a tradeoff with on the cost on Dask code side.
Does Pandas have optimizations for this?
cc @rjzamora @mrocklin.
Since pandas is eager, in memory, there is no chance for it to do this kind of optimization.
To be sure, dask already has column pushdown that takes code complexity and CPU time when computing, it just doesn’t work for anything but the very simplest case of load followed by select only. In that case, it very much makes sense to tell users to select, since they are doing that anyway, but in more complex situations, users will have a harder time back-propagating column usage in graph.
I should point out that any SQL system (including the likes of spark) does this routinely and have for decades.