Dask_cudf/dask read_parquet failed with NotImplementedError: large_string

I am a new user of dask/dask_cudf.
I have a parquet files of various sizes (11GB, 2.5GB, 1.1GB), all of which failed with NotImplementedError: large_string. My dask.dataframe backend is cudf. When the backend is pandas, read.parquet works fine.

Here’s an exerpt of what my data look like in csv format:

AADR,17-Oct-2017 09:00,57.47,58.3844,57.3645,58.3844,2094
AADR,17-Oct-2017 10:00,57.27,57.2856,57.25,57.27,627
AADR,17-Oct-2017 11:00,56.99,56.99,56.99,56.99,100
AADR,17-Oct-2017 12:00,56.98,57.05,56.98,57.05,200
AADR,17-Oct-2017 13:00,57.14,57.16,57.14,57.16,700
AADR,17-Oct-2017 14:00,57.13,57.13,57.13,57.13,100
AADR,17-Oct-2017 15:00,57.07,57.07,57.07,57.07,200
AAMC,17-Oct-2017 09:00,87,87,87,87,100
AAU,17-Oct-2017 09:00,1.1,1.13,1.0832,1.121,67790
AAU,17-Oct-2017 10:00,1.12,1.12,1.12,1.12,100
AAU,17-Oct-2017 11:00,1.125,1.125,1.125,1.125,200
AAU,17-Oct-2017 12:00,1.1332,1.15,1.1332,1.15,27439
AAU,17-Oct-2017 13:00,1.15,1.15,1.13,1.13,8200
AAU,17-Oct-2017 14:00,1.1467,1.1467,1.14,1.1467,1750
AAU,17-Oct-2017 15:00,1.1401,1.1493,1.1401,1.1493,4100
AAU,17-Oct-2017 16:00,1.13,1.13,1.13,1.13,100
ABE,17-Oct-2017 09:00,14.64,14.64,14.64,14.64,200
ABE,17-Oct-2017 10:00,14.67,14.67,14.66,14.66,1200
ABE,17-Oct-2017 11:00,14.65,14.65,14.65,14.65,600
ABE,17-Oct-2017 15:00,14.65,14.65,14.65,14.65,836

What I did was really simple:

import dask.dataframe as dd
import cudf
import dask_cudf

# Failed with large_string error
# Failed with large_string error

The only large string I could think of is the timestamp string.

Is there a way around this in cudf or dask as it is not implemented yet? The format is 2023-03-12 09:00:00+00:00.

High @qiuxiao,

I finally managed to install Rapids, but was not able to reproduce your problem.

After copying your CSV content into an input.csv file, here is the code I ran:

import dask.dataframe as dd
import cudf
import dask_cudf
import dask

with dask.config.set({"dataframe.backend": "cudf"}):
    df_cu = dask_cudf.read_csv('input.csv')


with dask.config.set({"dataframe.backend": "cudf"}):
    df_parquet = dask_cudf.read_parquet('input.parquet')

Inside my notebook, I was able to read an get the Data output, with the date column well recognized.

Maybe your Parquet files have some other particularity?