Hello! I’m not sure whether the problem here is with Dask or Colab (or my own code, of course), and I’m not sure where to go next with troubleshooting. I’m very new to this, so please forgive any errors in terminology.
I have a Dask Dataframe with almost 70,000 columns and just over 20,000 rows. Nearly all of the columns have datatype float, with two that are strings and one (the target feature for some multilabel classification I’m planning to do later) is made up of lists.
I’m attempting to make the target column (the column of lists) into a pandas dataframe so I can binarize it with get_dummies(). However, no matter what I try to do with the dask dataframe, Google Colab gets stuck on:
<cell line: 1> > compute() > compute() > get() > get_async() > queue_get() > get() > wait().
I was able to confirm that my dataframe is a dask dataframe, and doing ddf.shape gives me “(Delayed(‘int-75bf401d-4854-4b6d-9be1-34b78a0f180a’), 66198)”. I can view the column names and their types, no problem.
Things I have tried that have caused me to get stuck include:
new_df = ddf[column].compute()
new_df = ddf[column].sample(frac=.001).compute()
new_df = ddf.sample(frac=0.001).compute()
new_df = ddf[column].head()
print(ddf.sample.head())
and any combination of printing, assigning, .head(), sample(), .compute(), and using just one of the columns that I could think of. I even tried using one of the float columns instead of the list column, just to see if that was the problem. No dice. Hypothetically, the sample with frac=.001 should take it down to a dataframe with less than 30 rows.
I switched to a paid version of Google Colab just to see if it would help. It made everything else faster, but this cell still hangs.
Any suggestions on how to further troubleshoot, how to fix this, or where else I could ask for help would be greatly appreciated!