I am a new user of
I have a parquet files of various sizes (11GB, 2.5GB, 1.1GB), all of which failed with
NotImplementedError: large_string. My
dask.dataframe backend is
cudf. When the backend is
read.parquet works fine.
Here’s an exerpt of what my data look like in
Symbol,Date,Open,High,Low,Close,Volume AADR,17-Oct-2017 09:00,57.47,58.3844,57.3645,58.3844,2094 AADR,17-Oct-2017 10:00,57.27,57.2856,57.25,57.27,627 AADR,17-Oct-2017 11:00,56.99,56.99,56.99,56.99,100 AADR,17-Oct-2017 12:00,56.98,57.05,56.98,57.05,200 AADR,17-Oct-2017 13:00,57.14,57.16,57.14,57.16,700 AADR,17-Oct-2017 14:00,57.13,57.13,57.13,57.13,100 AADR,17-Oct-2017 15:00,57.07,57.07,57.07,57.07,200 AAMC,17-Oct-2017 09:00,87,87,87,87,100 AAU,17-Oct-2017 09:00,1.1,1.13,1.0832,1.121,67790 AAU,17-Oct-2017 10:00,1.12,1.12,1.12,1.12,100 AAU,17-Oct-2017 11:00,1.125,1.125,1.125,1.125,200 AAU,17-Oct-2017 12:00,1.1332,1.15,1.1332,1.15,27439 AAU,17-Oct-2017 13:00,1.15,1.15,1.13,1.13,8200 AAU,17-Oct-2017 14:00,1.1467,1.1467,1.14,1.1467,1750 AAU,17-Oct-2017 15:00,1.1401,1.1493,1.1401,1.1493,4100 AAU,17-Oct-2017 16:00,1.13,1.13,1.13,1.13,100 ABE,17-Oct-2017 09:00,14.64,14.64,14.64,14.64,200 ABE,17-Oct-2017 10:00,14.67,14.67,14.66,14.66,1200 ABE,17-Oct-2017 11:00,14.65,14.65,14.65,14.65,600 ABE,17-Oct-2017 15:00,14.65,14.65,14.65,14.65,836
What I did was really simple:
import dask.dataframe as dd import cudf import dask_cudf # Failed with large_string error dask_cudf.read_parquet('path/to/my.parquet') # Failed with large_string error dd.read_parquet('path/to/my.parquet')
The only large string I could think of is the timestamp string.
Is there a way around this in
dask as it is not implemented yet? The format is