Dask DataFrames getting stuck on Google Colab

Hello! I’m not sure whether the problem here is with Dask or Colab (or my own code, of course), and I’m not sure where to go next with troubleshooting. I’m very new to this, so please forgive any errors in terminology.

I have a Dask Dataframe with almost 70,000 columns and just over 20,000 rows. Nearly all of the columns have datatype float, with two that are strings and one (the target feature for some multilabel classification I’m planning to do later) is made up of lists.

I’m attempting to make the target column (the column of lists) into a pandas dataframe so I can binarize it with get_dummies(). However, no matter what I try to do with the dask dataframe, Google Colab gets stuck on:
<cell line: 1> > compute() > compute() > get() > get_async() > queue_get() > get() > wait().

I was able to confirm that my dataframe is a dask dataframe, and doing ddf.shape gives me “(Delayed(‘int-75bf401d-4854-4b6d-9be1-34b78a0f180a’), 66198)”. I can view the column names and their types, no problem.

Things I have tried that have caused me to get stuck include:
new_df = ddf[column].compute()
new_df = ddf[column].sample(frac=.001).compute()
new_df = ddf.sample(frac=0.001).compute()
new_df = ddf[column].head()
print(ddf.sample.head())

and any combination of printing, assigning, .head(), sample(), .compute(), and using just one of the columns that I could think of. I even tried using one of the float columns instead of the list column, just to see if that was the problem. No dice. Hypothetically, the sample with frac=.001 should take it down to a dataframe with less than 30 rows.

I switched to a paid version of Google Colab just to see if it would help. It made everything else faster, but this cell still hangs.

Any suggestions on how to further troubleshoot, how to fix this, or where else I could ask for help would be greatly appreciated!

Hi @EddisFargo,

Does this problems shows up only with this dataset? Are you able to use Dask DataFrame on Colab usually?

Are you able to read part of the dataset using Pandas?

How do you read the input data?

Hello @guillaumeeb ! Thanks so much for your reply!

This is my first time ever using Dask or Colab, so I’m afraid I don’t have much of a basis for comparison.

I was indeed able to read it using pandas, and ended up doing what I set out to do without using Dask at all. I’d still like to figure out what went wrong so Dask will be an option for me in the future, if possible.

The input data was coming from a series of 21.csvs that I read in as individual dataframes and then concatenated.

Wasn’t sure what else would be relevant to the problem I was having–when I searched, I couldn’t find anyone else in this exact situation.

Thanks again!

Did you try using Dask in Colab with some Dask examples?