Dask_cudf/dask read_parquet failed with NotImplementedError: large_string

qiuxiao · March 30, 2023, 4:29pm

I am a new user of dask/dask_cudf.
I have a parquet files of various sizes (11GB, 2.5GB, 1.1GB), all of which failed with NotImplementedError: large_string. My dask.dataframe backend is cudf. When the backend is pandas, read.parquet works fine.

Here’s an exerpt of what my data look like in csv format:

Symbol,Date,Open,High,Low,Close,Volume
AADR,17-Oct-2017 09:00,57.47,58.3844,57.3645,58.3844,2094
AADR,17-Oct-2017 10:00,57.27,57.2856,57.25,57.27,627
AADR,17-Oct-2017 11:00,56.99,56.99,56.99,56.99,100
AADR,17-Oct-2017 12:00,56.98,57.05,56.98,57.05,200
AADR,17-Oct-2017 13:00,57.14,57.16,57.14,57.16,700
AADR,17-Oct-2017 14:00,57.13,57.13,57.13,57.13,100
AADR,17-Oct-2017 15:00,57.07,57.07,57.07,57.07,200
AAMC,17-Oct-2017 09:00,87,87,87,87,100
AAU,17-Oct-2017 09:00,1.1,1.13,1.0832,1.121,67790
AAU,17-Oct-2017 10:00,1.12,1.12,1.12,1.12,100
AAU,17-Oct-2017 11:00,1.125,1.125,1.125,1.125,200
AAU,17-Oct-2017 12:00,1.1332,1.15,1.1332,1.15,27439
AAU,17-Oct-2017 13:00,1.15,1.15,1.13,1.13,8200
AAU,17-Oct-2017 14:00,1.1467,1.1467,1.14,1.1467,1750
AAU,17-Oct-2017 15:00,1.1401,1.1493,1.1401,1.1493,4100
AAU,17-Oct-2017 16:00,1.13,1.13,1.13,1.13,100
ABE,17-Oct-2017 09:00,14.64,14.64,14.64,14.64,200
ABE,17-Oct-2017 10:00,14.67,14.67,14.66,14.66,1200
ABE,17-Oct-2017 11:00,14.65,14.65,14.65,14.65,600
ABE,17-Oct-2017 15:00,14.65,14.65,14.65,14.65,836

What I did was really simple:

import dask.dataframe as dd
import cudf
import dask_cudf

# Failed with large_string error
dask_cudf.read_parquet('path/to/my.parquet')
# Failed with large_string error
dd.read_parquet('path/to/my.parquet')

The only large string I could think of is the timestamp string.

Is there a way around this in cudf or dask as it is not implemented yet? The format is 2023-03-12 09:00:00+00:00.

guillaumeeb · April 4, 2023, 11:12am

High @qiuxiao,

I finally managed to install Rapids, but was not able to reproduce your problem.

After copying your CSV content into an input.csv file, here is the code I ran:

import dask.dataframe as dd
import cudf
import dask_cudf
import dask

with dask.config.set({"dataframe.backend": "cudf"}):
    df_cu = dask_cudf.read_csv('input.csv')
    df_cu.compute()

df_cu.to_parquet('input.parquet')

with dask.config.set({"dataframe.backend": "cudf"}):
    df_parquet = dask_cudf.read_parquet('input.parquet')
    df_parquet.head()

Inside my notebook, I was able to read an get the Data output, with the date column well recognized.

Maybe your Parquet files have some other particularity?

Topic		Replies	Views
Read_parquet caused "TypeError: '<' not supported between instances of 'NoneType' and 'str'" Dask DataFrame	4	1318	February 17, 2023
Reading Parquet directory from HDFS Distributed parquet , dask-yarn , distributed	4	428	February 12, 2024
How to improve Dask read_parquet performance while reading 20000 parquet files (very few are corrupted)? Dask DataFrame	0	204	October 17, 2022
Reading Parquet from Company HDFS Distributed distributed	2	246	December 4, 2023
How does read_csv or read_parquet distribute read operations? Dask DataFrame	3	318	June 14, 2022

Dask_cudf/dask read_parquet failed with NotImplementedError: large_string

Related topics