How to make set_index() use a proper scheduler/client?

when calling set_index() on a dataframe without specifying the division, Dask triggers a computation right away. This is fine. But it’s using a “threaded” client instead of the properly one we created with Client(address=...). I think the reason is that we set set_as_default to False when creating the proper client. I do have the proper object at hand. How can I tell set_index() to use it?

Also about not to use set_as_default, it’s a decision previously made in my team(I’ll have more discussion on the why). But it seems with that set to False, there are surprises when you call functions like df.compute() or here df.set_index(), for the prior we can still specify df.compute(scheduler=our_client). Do you generally recommend that we should leave set_as_default to True anyways?

@ubw218 I believe you can do something like:

with dask.config.set(scheduler=your_client):
    df.set_index('x').compute()

Ref docs: Scheduling — Dask documentation

Do you generally recommend that we should leave set_as_default to True anyways?

Yes, we are recommending the distributed scheduler for local computations as well, so it’s best to use this as default. :smile:

1 Like