How to tell dask about timezone info in `dd.to_datetime`?

I’m don’t think this is a bug, but I’m a little bit stumped around how to tell dask the datatype in dd.to_datetime.

Obviously the datatype is “datetime64” but there are methods that will throw errors on “datetime64[ns, UTC]” (like tz_localize) that won’t throw errors on “datetime64[ns]” and vice-versa (like tz_convert) so not having run-time information on the timezone info is currently throwing up a bug for me.

My tricky bit, is that the date is being converted initially from a string, like so:

df = dd.from_dict({"a": ["2023-04-01 22:12:11.932417+00:00"]}, npartitions=1)
df["a"] = dd.to_datetime(df, format="ISO8601")

The string contains timezone info, but dask doesn’t know this until it computes things (wierdly this is only the case with dask-expr backend, and not the old school eager execution).

For instance, I get different answers if I do this:

print(df["a"].dtype)  # dtype('<M8[ns]')
print(df["a"].compute().dtype)  # datetime64[ns, UTC]

I can see why this happens, dask hasn’t looked at the string, so has no way of knowing that the whole column is utc. How can I tell it? I was thinking along the lines of something like this:

df["a"] = dd.to_datetime(df, format="ISO8601", meta=("a", "datetime64[ns, UTC]"))

Hi @benrutter,

Do you have a reproducer? I tried your code, but I’m getting an error:

File /work/scratch/env/eynardbg/.conda/envs/python_pangeo_lis/lib/python3.12/site-packages/pandas/core/tools/datetimes.py:1186, in _assemble_from_unit_mappings(arg, errors, utc)
   1184 if len(req):
   1185     _required = ",".join(req)
-> 1186     raise ValueError(
   1187         "to assemble mappings requires at least that "
   1188         f"[year, month, day] be specified: [{_required}] is missing"
   1189     )
   1191 # keys we don't recognize
   1192 excess = sorted(set(unit_rev.keys()) - set(_unit_map.values()))

ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing

Yes, using meta should be the solution.

Yeah, meta is correct. There is currently a bug in how we handle it for to_datetime. Fix is here: Fix meta calculation for to_datetime by phofl · Pull Request #1153 · dask/dask-expr · GitHub

2 Likes

Thanks both! And thanks for the speedy fix @Patrick - if I’m reading right though, that PR fixes interpretting the meta information, but I’m still unsure how to actually manually specifiy it! :sweat_smile:

@guillaumeeb my reproducer in the initial comment should work bar a silly error I made, here’s what the original code block should be:

df = dd.from_dict({"a": ["2023-04-01 22:12:11.932417+00:00"]}, npartitions=1)
df["a"] = dd.to_datetime(df["a"], format="ISO8601")

(I mistyped calling to_datetime on df initially rather than df["a"])

Any idea how to pass in meta?

Doing this:

df["a"] = dd.to_datetime(df["a"], format="ISO8601", meta=("a", "datetime64[ns, UTC]")

Seems like it should work, but I still see that the datatype of df["a"] is dtype('<M8[ns]').

I seem to remember that this kind of hack used to work:

df._meta = df._meta.assign(a=pd.Series([], dtype="datetime64[ns, UTC]")

If it did work previously, it doesn’t now, I get a “can’t set attribute _meta”. I’m guessing this is extra strictness added in to stop unwittingly carrying out hijinx with the optimiser?

Setting meta in to_datetime
Will work after you update to my pr, the information is currently ignored unfortunately but you are already specifying it correctly

1 Like