Read_parquet caused "TypeError: '<' not supported between instances of 'NoneType' and 'str'"

farmd · February 3, 2023, 11:05pm

Tried to create a dask dataframe with read_parquet() and raised "TypeError: ‘<’ not supported between instances of ‘NoneType’ and ‘str’ ". The same line used to work couple days ago, but is not work now. What is causing the problem?

Below is the error message.

File ~/parts_consumption_sk/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py:326, in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, ignore_metadata_file, metadata_task_size, split_row_groups, chunksize, aggregate_files, **kwargs)
    323         raise ValueError("read_parquet options require gather_statistics=True")
    324     gather_statistics = True
--> 326 read_metadata_result = engine.read_metadata(
    327     fs,
    328     paths,
    329     categories=categories,
    330     index=index,
    331     gather_statistics=gather_statistics,
    332     filters=filters,
    333     split_row_groups=split_row_groups,
    334     chunksize=chunksize,
    335     aggregate_files=aggregate_files,
    336     ignore_metadata_file=ignore_metadata_file,
    337     metadata_task_size=metadata_task_size,
    338     **kwargs,
    339 )
    341 # In the future, we may want to give the engine the
    342 # option to return a dedicated element for `common_kwargs`.
    343 # However, to avoid breaking the API, we just embed this
    344 # data in the first element of `parts` for now.
    345 # The logic below is inteded to handle backward and forward
    346 # compatibility with a user-defined engine.
    347 meta, statistics, parts, index = read_metadata_result[:4]

File ~/parts_consumption_sk/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py:319, in ArrowDatasetEngine.read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, chunksize, aggregate_files, ignore_metadata_file, metadata_task_size, **kwargs)
    301 @classmethod
    302 def read_metadata(
    303     cls,
   (...)
    317 
    318     # Stage 1: Collect general dataset information
--> 319     dataset_info = cls._collect_dataset_info(
    320         paths,
    321         fs,
    322         categories,
    323         index,
    324         gather_statistics,
    325         filters,
    326         split_row_groups,
    327         chunksize,
    328         aggregate_files,
    329         ignore_metadata_file,
    330         metadata_task_size,
    331         **kwargs.get("dataset", {}),
    332     )
    334     # Stage 2: Generate output `meta`
    335     meta = cls._create_dd_meta(dataset_info)

File ~/parts_consumption_sk/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py:915, in ArrowDatasetEngine._collect_dataset_info(cls, paths, fs, categories, index, gather_statistics, filters, split_row_groups, chunksize, aggregate_files, ignore_metadata_file, metadata_task_size, **dataset_kwargs)
    913 partition_names = list(hive_categories)
    914 for name in partition_names:
--> 915     partition_obj.append(PartitionObj(name, hive_categories[name]))
    917 # Check the `aggregate_files` setting
    918 aggregation_depth = _get_aggregation_depth(aggregate_files, partition_names)

File ~/parts_consumption_sk/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py:152, in PartitionObj.__init__(self, name, keys)
    150 def __init__(self, name, keys):
    151     self.name = name
--> 152     self.keys = sorted(keys)

TypeError: '<' not supported between instances of 'NoneType' and 'str'

guillaumeeb · February 4, 2023, 10:22am

Hi @farmd,

This looks like a metadata problem, how are you calling read_parquet? Do you provide metadata for the columns, or do you let Dask infer their types?

Has you input data been modified or are you reading a different Parquet file than two days ago?

farmd · February 6, 2023, 4:32am

Hi @guillaumeeb,

I let Dask infer their types, and I use the line,

ddf = dd.read_parquet(<parquet_path>, columns=[<needed_column_names>])

Between this time and last time I read the parquet file, there’s a new partition partition added, and it is called “_HIVE_DEFAULT_PARTITION_”. With some research, it seems this partition is created when the partitioned column has a NULL value. I wonder if this partition is causing this error?

guillaumeeb · February 6, 2023, 4:36pm

A new partition to the dask dataframe?

I’d bet it is this new partition, maybe created by new data in your parquet file (do you know what was added to the file?). There is probably data with None values that has been added. If possible, I would advise you try to give the dtypes of your columns to Dask, this would maybe prevent the error.

farmd · February 17, 2023, 9:10pm

@guillaumeeb
Thanks for the advice! We figured there are null values in the partition column in the HIVE_DEFAULT_PARTITION, and it had been taken care of. Everything works now.

Topic		Replies	Views
KeyError while using the read_parquet method Dask DataFrame	10	1060	August 21, 2023
How can I solve "Metadata mismatch found in `from_delayed`" when using to_parquet? Dask DataFrame	2	1205	January 19, 2023
Error when calling to_parquet() "TypeError: argument of type 'int' is not iterable" Dask DataFrame parquet , partitioning	2	139	March 29, 2024
Error when creating pyarrow schema from dask dataframe Dask DataFrame parquet , pyarrow	2	1736	June 1, 2023
Speeding up hive partitioned queries Dask DataFrame parquet	4	102	October 25, 2024

Read_parquet caused "TypeError: '<' not supported between instances of 'NoneType' and 'str'"

Related topics