I’d like to know how column projection optimization in high layer graphs works.
We’ve done a simple test:
import dask.dataframe as dd
import numpy as np
import pandas as pd
pandas_df = pd.DataFrame(np.random.random((1000, 3)), columns=['a','b','c'])
pandas_df.to_parquet('test.parquet')
x = dd.read_parquet('test.parquet')
x_cols = x.a
x_cols.__dask_graph__()
We’d expect to see only one read-parquet layer with the column [‘a’] in the HLG. But we actually get two layers a read-parquet with columns [‘a’,‘b’,‘c’] and a getitem. Are we misunderstanding something about the column projection?
Hi Milton,
When you create the new task (with x.a) the new task graph associated with x_cols will be “complete” until the column projection optimization is actually triggered (i.e. by compute, which defaults to optimizing the collection’s graph). If you call x_col.visualize(optimize_graph=True) you’ll see a resulting graph with a single layer (and that layer should be optimized to select only column “a”). Side node: the default value for the optimize_graph argument in the visualize method on Dask collections is False.
A bit more detail: when you call coll.compute(optimize_graph=True), the __dask_optimize__ attribute that belongs to coll will be called on coll’s graph. You can read quite a bit more here: Custom Collections — Dask documentation