How does dataframe column projection optimization work?

miltava · July 14, 2022, 6:45pm

Hey everyone,

I’d like to know how column projection optimization in high layer graphs works.

We’ve done a simple test:

import dask.dataframe as dd
import numpy as np
import pandas as pd

pandas_df = pd.DataFrame(np.random.random((1000, 3)), columns=['a','b','c'])
pandas_df.to_parquet('test.parquet')

x = dd.read_parquet('test.parquet')
x_cols = x.a
x_cols.__dask_graph__()

We’d expect to see only one read-parquet layer with the column [‘a’] in the HLG. But we actually get two layers a read-parquet with columns [‘a’,‘b’,‘c’] and a getitem. Are we misunderstanding something about the column projection?

Thanks

Milton

ddavis · July 14, 2022, 10:05pm

Hi Milton,
When you create the new task (with x.a) the new task graph associated with x_cols will be “complete” until the column projection optimization is actually triggered (i.e. by compute, which defaults to optimizing the collection’s graph). If you call x_col.visualize(optimize_graph=True) you’ll see a resulting graph with a single layer (and that layer should be optimized to select only column “a”). Side node: the default value for the optimize_graph argument in the visualize method on Dask collections is False.

A bit more detail: when you call coll.compute(optimize_graph=True), the __dask_optimize__ attribute that belongs to coll will be called on coll’s graph. You can read quite a bit more here: Custom Collections — Dask documentation

cheers,
Doug

miltava · July 14, 2022, 10:42pm

That’s great Doug. Thanks

Topic		Replies	Views
Column optimzation Dask DataFrame optimization , high-level-graph	3	247	May 2, 2023
When adding new columns to dataframes, accessing columns gets slower because all new columns are always computed Dask DataFrame	6	981	October 9, 2023
High Level Query Optimization in Dask Blogs	0	179	September 5, 2023
DataFrameIOLayer becomes MaterializedLayer when pickled Dask DataFrame distributed	6	404	July 6, 2022
Using dask-awkward to speed up dask-awkward Blogs	0	226	April 11, 2023

How does dataframe column projection optimization work?

Related topics