I am trying to find the number of distinct values of each combination of two columns of a Dask Dataframe and store the results to a Dask Dataframe which contains the following columns:
Distinct number, column name 1, column name 2.
The following code works fine using Pandas.
number_of_disincts = ['number_of_disincts']
columns_names = [f'column name{i+1}' for i in range(2)]
distincts = np.array([], dtype=int)
results = pd.DataFrame(columns=number_of_disincts + columns_names)
combinations = list(itertools.combinations(data.columns, 2))
for combination in combinations:
distinct_tuples = ddf.drop_duplicates(subset=combination).shape[0].compute()
distincts = np.append(distincts, [distinct_tuples])
results[number_of_disincts[0]] = distincts
results[columns_names] = combinations
When I try to use Dask Dataframe to store the results using the following code:
number_of_disincts = ['number_of_disincts']
columns_names = [f'column name{i+1}' for i in range(2)]
distincts = da.from_array(np.array([], dtype=int))
results = dd.from_pandas(pd.DataFrame(columns=number_of_disincts + columns_names), npartitions=2)
combinations = list(itertools.combinations(data.columns, 2))
for combination in combinations:
distinct_tuples = ddf.drop_duplicates(subset=combination).shape[0].compute()
distincts = da.append(distincts, [distinct_tuples])
results[number_of_disincts[0]] = distincts
results[columns_names] = combinations
this row: results[number_of_disincts[0]] = distincts leads to the following error:
ValueError: Number of partitions do not match
and this row: results[columns_names] = combinations leads to the following error:
NotImplementedError: Item assignment with <class ‘list’> not supported
Is there any workaround to have the same functionality as Pandas?