Task Stream Understanding "Assign" time

I’m quite new to dask, and trying to understand the task stream. I’m seeing all these “assign” blocks, each taking aprox. with avg of 100s. Does that more or less mean each row is taking 100s seconds to create?

Is there a way to tell how long each of my apply funcs is taking?

@winddude Welcome!

Could you please share a minimal version of your workflow? I’ll be able to share more details then.

In broad terms though, looking at your task stream, I can just infer that the “assign” task is taking ~127s to execute on that thread.

Is there a way to tell how long each of my apply funcs is taking?

Besides the task stream, I think the “profile” plot might help?

Below is a the flow I’m using for the dask dataframe

warc_dict = {i['warc_info.warc_id']:i for i in iterate_warc_records()}
ddf = dd.DataFrame.from_dict(warc_dict, orient='index', npartitions=100)
ddf['schema_digest'] = ddf.apply(get_schema_digest, axis=1, meta=('schema_digest', object))
ddf[['lang', 'title', 'authors', 'content', 'date_published']] = ddf.apply(build_goose_extract, axis=1)
ddf = ddf[ddf.lang=='en']  # take only 'english' from goose
ddf = ddf[ddf.content.str.len() > 0]  # drop where body is None
ddf['keywords'] = ddf.apply(extract_keywords, axis=1, meta=('keywords', object))
ddf.compute()

@winddude Thanks for the code!

I think you can use the “groups” plot to get the total time spent on each task:

import dask.dataframe as dd
from dask.distributed import Client

client = Client()

ddf = dd.DataFrame.from_dict(
    {"x": range(1_000_000), "y": range(1_000_000)}, npartitions=4
)

def func(x):
    return x

res = ddf.apply(func, axis=1).persist()

(Note that you would need to use persist() because this plot is cleared after compute())

Also, I see you have multiple apply statements, there’s currently no way to distinguish them, but as a work-around, you can rewrite your code to use low-level collections. I wouldn’t recommend re-writing unless it’s an absolute deal-breaker, because the current dashboard plots can still give you lot of useful information. :slight_smile:

1 Like