Actual and meta columns mismatch

DRudel · July 20, 2022, 8:09am

Unfortunately, my efforts to create a minimal example failed.
I’m working with some Kaggle data for the American Express competition and trying to use DASK for the first time.

I’m working with a small 50k row data set to work on operations, and I’m running into a maddening issue. I’m trying to perform an operation and I’m getting the following error:

ValueError: The columns in the computed data do not match the columns in the provided metadata
Order of columns does not match

I have:

my_dtypes = pickle.load(open('my_dtypes.pickle', 'rb'))
my_dd = dd.read_csv('sample_data.csv', dtype=my_dtypes)
my_dd = my_dd.categorize(CATEGORICAL_FEATURES)
my_dd = dd.reshape.get_dummies(my_dd, columns=CATEGORICAL_FEATURES)
print(my_dd.head())

I use my debugger to compare the column value for actual and meta and see that the newly formed columns are in different orders:

Index([‘Unnamed: 0’, ‘S_2’, ‘P_2’, ‘D_39’, ‘B_1’, ‘B_2’, ‘R_1’, ‘S_3’, ‘D_41’,
‘B_3’,
…
‘D_117_6.0’, ‘D_117_2.0’, ‘D_117_1.0’, ‘D_117_3.0’, ‘D_117_5.0’,
‘D_120_1.0’, ‘D_120_0.0’, ‘D_126_1.0’, ‘D_126_0.0’, ‘D_126_-1.0’],
dtype=‘object’, length=223)

Index([‘Unnamed: 0’, ‘S_2’, ‘P_2’, ‘D_39’, ‘B_1’, ‘B_2’, ‘R_1’, ‘S_3’, ‘D_41’,
‘B_3’,
…
‘D_117_6.0’, ‘D_117_2.0’, ‘D_117_1.0’, ‘D_117_3.0’, ‘D_117_5.0’,
‘D_120_0.0’, ‘D_120_1.0’, ‘D_126_1.0’, ‘D_126_0.0’, ‘D_126_-1.0’],
dtype=‘object’, length=223)

How do I fix this?

DRudel · July 21, 2022, 3:30am

I’ve opened a ticket with a reproducible.

github.com/dask/dask

Mismatched columns created after get_dummies() [ValueError: Order of columns does not match]

opened 03:30AM - 21 Jul 22 UTC

closed 01:24PM - 21 Jul 22 UTC

DRudel

needs triage

This may be related to [6865](https://github.com/dask/dask/issues/6856) **What …happened**: I called `get_dummies()` and then called head() to check results. I got a ValueError: Mismatched columns **What you expected to happen**: Top 5 rows of new version are printed **Minimal Complete Verifiable Example**: ```python my_dd = dd.read_csv('repro.csv') my_dd = my_dd.categorize(my_dd.columns) my_dd = dd.reshape.get_dummies(my_dd) print(my_dd.head()) ``` **Anything else we need to know?**: The order of the E columns is switched between `actual` and `meta`. When using pycharm debugger to inspect columns, prior to calling `head()`, the columns are in the correct order, but somehow the two E columns (E_1.0 and E_0.0) get switched in `actual` at some point without being switched in `meta`. **Environment**: - Dask version: 2022_7.0 - Python version: 3.9 - Operating System: Windows (WinPython) - Install method (conda, pip, source): pip [repro.csv](https://github.com/dask/dask/files/9155952/repro.csv)

pavithraes · July 22, 2022, 4:17pm

As I mentioned on GitHub, this was a bug and it has been fixed.

Topic		Replies	Views
Df[cols].drop_duplicates().compute() causes ValueError: The columns in the computed data do not match the columns in the provided metadata Dask DataFrame	1	35	February 14, 2025
How can I solve "Metadata mismatch found in `from_delayed`" when using to_parquet? Dask DataFrame	2	1198	January 19, 2023
Meta='int' failed Dask DataFrame	1	220	January 15, 2022
Inconsistencies with Dask Columns & Indices Dask DataFrame	5	29	January 31, 2025
Dask Dataframe, how to keep column with array values Dask DataFrame	2	227	August 16, 2023

Actual and meta columns mismatch

Related topics