How does dataframe column projection optimization work?

Hey everyone,

I’d like to know how column projection optimization in high layer graphs works.

We’ve done a simple test:

import dask.dataframe as dd
import numpy as np
import pandas as pd

pandas_df = pd.DataFrame(np.random.random((1000, 3)), columns=['a','b','c'])
pandas_df.to_parquet('test.parquet')

x = dd.read_parquet('test.parquet')
x_cols = x.a
x_cols.__dask_graph__()

We’d expect to see only one read-parquet layer with the column [‘a’] in the HLG. But we actually get two layers a read-parquet with columns [‘a’,‘b’,‘c’] and a getitem. Are we misunderstanding something about the column projection?

Thanks

Milton

Hi Milton,
When you create the new task (with x.a) the new task graph associated with x_cols will be “complete” until the column projection optimization is actually triggered (i.e. by compute, which defaults to optimizing the collection’s graph). If you call x_col.visualize(optimize_graph=True) you’ll see a resulting graph with a single layer (and that layer should be optimized to select only column “a”). Side node: the default value for the optimize_graph argument in the visualize method on Dask collections is False.

A bit more detail: when you call coll.compute(optimize_graph=True), the __dask_optimize__ attribute that belongs to coll will be called on coll’s graph. You can read quite a bit more here: Custom Collections — Dask documentation

cheers,
Doug

2 Likes

That’s great Doug. Thanks

1 Like