Using dask.dataframe's to_datetime on a pandas dataframe

I hit an error in some code a while back where I was accidentally passing a pandas dataframe into dask’s “to_datetime” function. I’d have expected it to either run as normal or throw an error, but the output seemed to be an assortment of duplicated rows.

I realise this isn’t an error on dask’s part at all, but in my bad implementatinon (and the fix is simple enough) but since the output seems counterintuitive to what I’d expect, I’m curious about why it is that this happens? Does anyone know what’s going on under the hood for this to outuput?

Hi @benrutter, welcome to Dask Discourse Forum!

Do you have a reproducer? I just tried with a small example, and it’s working as I would expect:

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame({'year': [2015, 2016],
                   'month': [2, 3],
                   'day': [4, 5]})
dd.to_datetime(df).compute()

Result:

0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]

Thanks @guillaumeeb!

I’ve tried recreating it, but actually can’t! Apologies - maybe it was caused by something else somewhere.

If I’m able to recreate it I’ll share it here

1 Like