Dask created a datetimeindex and I cannot assign it back to the source ddf

dennisd · February 28, 2022, 2:08am

In a book guide to converting object to date time, I was successful in the conversion using the pandas date time approach though it took 1 hour, dropping the column, but assigning it back to the original ddf is not working. I get a value error saying the length of the date time (12000000) is not the same length of the index (2). When I used the pandas date time approach, I lambda apply approach, it was fast but I get a different error saying the the first argument of step time should be a str and not float.

Let me know if I need to provide more details or the code to help. Thank you

pavithraes · February 28, 2022, 2:01pm

@dennisd Welcome to Discourse and thanks for your question!

Could you please share a minimal example and the documentation you’re referring to? It’ll allow us to help you better.

dennisd · February 28, 2022, 3:08pm

Hi,

I have three datetime columns that were initially uploaded to ddf as string. Then I have to convert them into datetime. I used the following guide from the book “Data Science with Python and Dask” modified according to my date format:

from datetime import datetime
date3_parsed = ddf['date3'].apply(lambda x: datetime.strptime(x, "%d%b%Y"), meta=datetime)
date3_a = ddf.drop('date3', axis=1)
date3_b = date3_a.assign(date3=date3_parsed)

the first two date columns worked as expected. But the third date column is giving me a hard time. When I used the above, it gave me a TypeError:

TypeError: strptime() argument 1 must be str, not float

When I tried the following:

ddf['date3'] = pd.to_datetime(ddf['date3'], format = "%d%b%Y")

the conversion took an hour, then I get the following error:

ValueError: Length of values (13090962) does not match length of index (2)

I broke the steps apart.

parsed using the pd.to_datetime
dropped the column
assign it back

The parsing worked. I got the datetimeindex in this form

DatetimeIndex([2021-12-31', '2021-11-30','NaT', '2022-03-20']).

Although when I pass on the .head(), it says datetimeindex does not have head.

The drop column also worked.

The issue was with the last step, that is when I get the above ValueError.

pavithraes · March 1, 2022, 1:34pm

@dennisd Thanks for the details!

dennisd:

When I tried the following:
ddf['date3'] = pd.to_datetime(ddf['date3'], format = "%d%b%Y")
the conversion took an hour, then I get the following error:
ValueError: Length of values (13090962) does not match length of index (2)

I was able to reproduce this, and looks like it’s because you’re calling pandas to_datetime, and assigning it to a Dask DataFrame. You’ll need to use Dask DataFrame’s API here:

from datetime import datetime

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'date3': ['1232021', '1332021', '1432021', '1532021', None]})
ddf = dd.from_pandas(df, npartitions=2)

ddf['date3'] = dd.to_datetime(ddf['date3'], format="%d%m%Y")
ddf.compute()

I believe you wouldn’t need your step-wise workaround after this. That said, just to clarify, you’re getting the TypeError because you may have floats/NaNs in your DataFrame, and datetime.strptime only accepts strings. So, you may need to clean your dataset before converting it to datetime.

Let me know if this helps!

dennisd · March 7, 2022, 4:37pm

Thanks, this helps me understand. but what if I have a NaN value in that date and it is important to keep them because I am doing only a data quality check? The original column is a str datatype. Thanks

‘date3’: [‘2021-12-15’, ‘2020-06-08’, NaN, NaN, ‘2025-01-22’, ‘2026-03-18’]

pavithraes · March 8, 2022, 10:56am

@dennisd I’m glad it helped!

I’m not sure if I completely understand your question, it’ll be great if you could elaborate a little and share some more context around the specific computation you’re referring to

So, dd.to_datetime (which is the recommended way to do datetime conversions in Dask DataFrame) should be able to handle NaN values. It converts them to NaT types, so they’ll be preserved.

Maybe, do you mean you’d like to keep NaN values in your step-wise approach? That might be a limitation of datetime.strptime. But you can use pd.to_datetime() in your lambda function (because Dask DataFrame apply will apply your function to the underlying pandas DataFrames), which can handle NaN values:

date3_parsed = ddf['date3'].apply(lambda x: pd.to_datetime(x, format="%d%b%Y"), meta=pd.Series())

Does this help answer your question?

Topic		Replies	Views
Convert column of string to column of datetime Dask DataFrame	1	1382	March 31, 2023
Using dask.dataframe's to_datetime on a pandas dataframe Dask DataFrame	2	197	July 3, 2023
How to tell dask about timezone info in `dd.to_datetime`? Dask DataFrame	4	39	October 21, 2024
DDF is converting column of lists/dicts to strings Dask DataFrame	2	1024	January 18, 2024
Why does dd.DataFrame say do not use this directly? Dask DataFrame	1	903	June 15, 2023

Dask created a datetimeindex and I cannot assign it back to the source ddf

Related topics