Dask DataFrames getting stuck on Google Colab

EddisFargo · August 25, 2023, 6:04pm

Hello! I’m not sure whether the problem here is with Dask or Colab (or my own code, of course), and I’m not sure where to go next with troubleshooting. I’m very new to this, so please forgive any errors in terminology.

I have a Dask Dataframe with almost 70,000 columns and just over 20,000 rows. Nearly all of the columns have datatype float, with two that are strings and one (the target feature for some multilabel classification I’m planning to do later) is made up of lists.

I’m attempting to make the target column (the column of lists) into a pandas dataframe so I can binarize it with get_dummies(). However, no matter what I try to do with the dask dataframe, Google Colab gets stuck on:
<cell line: 1> > compute() > compute() > get() > get_async() > queue_get() > get() > wait().

I was able to confirm that my dataframe is a dask dataframe, and doing ddf.shape gives me “(Delayed(‘int-75bf401d-4854-4b6d-9be1-34b78a0f180a’), 66198)”. I can view the column names and their types, no problem.

Things I have tried that have caused me to get stuck include:
new_df = ddf[column].compute()
new_df = ddf[column].sample(frac=.001).compute()
new_df = ddf.sample(frac=0.001).compute()
new_df = ddf[column].head()
print(ddf.sample.head())

and any combination of printing, assigning, .head(), sample(), .compute(), and using just one of the columns that I could think of. I even tried using one of the float columns instead of the list column, just to see if that was the problem. No dice. Hypothetically, the sample with frac=.001 should take it down to a dataframe with less than 30 rows.

I switched to a paid version of Google Colab just to see if it would help. It made everything else faster, but this cell still hangs.

Any suggestions on how to further troubleshoot, how to fix this, or where else I could ask for help would be greatly appreciated!

guillaumeeb · August 25, 2023, 8:00pm

Hi @EddisFargo,

Does this problems shows up only with this dataset? Are you able to use Dask DataFrame on Colab usually?

Are you able to read part of the dataset using Pandas?

How do you read the input data?

EddisFargo · August 25, 2023, 10:04pm

Hello @guillaumeeb ! Thanks so much for your reply!

This is my first time ever using Dask or Colab, so I’m afraid I don’t have much of a basis for comparison.

I was indeed able to read it using pandas, and ended up doing what I set out to do without using Dask at all. I’d still like to figure out what went wrong so Dask will be an option for me in the future, if possible.

The input data was coming from a series of 21.csvs that I read in as individual dataframes and then concatenated.

Wasn’t sure what else would be relevant to the problem I was having–when I searched, I couldn’t find anyone else in this exact situation.

Thanks again!

guillaumeeb · August 27, 2023, 2:14pm

Did you try using Dask in Colab with some Dask examples?

Topic		Replies	Views
Dask Dataframe, how to keep column with array values Dask DataFrame	2	227	August 16, 2023
Dask.dataframe.from_pandas gives error {FutureCancelledError}FutureCancelledError() Dask DataFrame	3	48	November 6, 2024
Dask group_by and getting the unique column count is taking a lot of time Dask DataFrame optimization , groupby , aggregation	4	687	January 2, 2024
How to upload dataframe with numpy array column using to_parquet in dask.dataframe? Dask DataFrame	2	802	August 29, 2023
Question: if I am mixing dask.delayed functions and using dask dataframes, are there any caveats to be aware of? Dask DataFrame delayed	5	730	August 21, 2023

Dask DataFrames getting stuck on Google Colab

Related topics