Unfortunately, my efforts to create a minimal example failed.
I’m working with some Kaggle data for the American Express competition and trying to use DASK for the first time.
I’m working with a small 50k row data set to work on operations, and I’m running into a maddening issue. I’m trying to perform an operation and I’m getting the following error:
ValueError: The columns in the computed data do not match the columns in the provided metadata
Order of columns does not match
I have:
my_dtypes = pickle.load(open('my_dtypes.pickle', 'rb'))
my_dd = dd.read_csv('sample_data.csv', dtype=my_dtypes)
my_dd = my_dd.categorize(CATEGORICAL_FEATURES)
my_dd = dd.reshape.get_dummies(my_dd, columns=CATEGORICAL_FEATURES)
print(my_dd.head())
I use my debugger to compare the column value for actual and meta and see that the newly formed columns are in different orders:
Index([‘Unnamed: 0’, ‘S_2’, ‘P_2’, ‘D_39’, ‘B_1’, ‘B_2’, ‘R_1’, ‘S_3’, ‘D_41’,
‘B_3’,
…
‘D_117_6.0’, ‘D_117_2.0’, ‘D_117_1.0’, ‘D_117_3.0’, ‘D_117_5.0’,
‘D_120_1.0’, ‘D_120_0.0’, ‘D_126_1.0’, ‘D_126_0.0’, ‘D_126_-1.0’],
dtype=‘object’, length=223)
Index([‘Unnamed: 0’, ‘S_2’, ‘P_2’, ‘D_39’, ‘B_1’, ‘B_2’, ‘R_1’, ‘S_3’, ‘D_41’,
‘B_3’,
…
‘D_117_6.0’, ‘D_117_2.0’, ‘D_117_1.0’, ‘D_117_3.0’, ‘D_117_5.0’,
‘D_120_0.0’, ‘D_120_1.0’, ‘D_126_1.0’, ‘D_126_0.0’, ‘D_126_-1.0’],
dtype=‘object’, length=223)
How do I fix this?