To_parquet - loading dask dataframe to GCS bucket not succesful

I am trying to load a Dask Dataframe (able to view a preview the dataframe using .head() ) into a GCS bucket using a Cluster. I have been successful doing this before but lately I cannot do this.

When I interrupt the kernel to make adjustments to my code I get the following error: CancelledError: Booster-0d7a05b4abdcb0f90f82d4e2d31f85fd - I have to stop the entire cluster to restart it to attempt running the code.

Unsure what to do and am at a loss of potential fixes - been looking at this error for a while now… updating dask[complete] doesn’t seem to work either.

Any tips appreciated.

Here is my package list:

Package                            Version     Location             
---------------------------------- ----------- ---------------------
aiohttp                            3.8.1       
aiosignal                          1.2.0       
alabaster                          0.7.12      
alembic                            1.7.5       
anaconda-client                    1.7.2       
anaconda-navigator                 1.9.7       
anaconda-project                   0.8.3       
ansiwrap                           0.8.4       
appdirs                            1.4.4       
arrow                              1.2.1       
asn1crypto                         1.0.1       
astroid                            2.3.1       
astropy                            3.2.2       
async-generator                    1.10        
async-timeout                      4.0.2       
asyncio                            3.4.3       
asynctest                          0.13.0      
atomicwrites                       1.3.0       
attrs                              21.4.0      
Babel                              2.7.0       
backcall                           0.1.0       
backports.functools-lru-cache      1.5         
backports.os                       0.1.1       
backports.shutil-get-terminal-size 1.0.0       
backports.tempfile                 1.0         
backports.weakref                  1.0.post1   
bcolz                              1.2.1       
beautifulsoup4                     4.8.0       
binaryornot                        0.4.4       
bitarray                           1.0.1       
bkcharts                           0.2         
black                              21.12b0     
bleach                             3.1.0       
blinker                            1.4         
bokeh                              2.4.3       
boto                               2.49.0      
Bottleneck                         1.2.1       
bs4                                0.0.1       
cachetools                         4.2.4       
certifi                            2019.9.11   
certipy                            0.1.3       
cffi                               1.12.3      
chardet                            3.0.4       
charset-normalizer                 2.0.10      
Click                              7.0         
cloudpickle                        2.2.1       
clyent                             1.2.2       
cmake                              3.22.4      
colorama                           0.4.1       
conda                              4.7.12      
conda-build                        3.18.9      
conda-package-handling             1.6.0       
conda-verify                       3.4.2       
confuse                            1.4.0       
contextlib2                        0.6.0       
cookiecutter                       1.7.3       
cryptography                       2.7         
cycler                             0.10.0      
Cython                             0.29.13     
cytoolz                            0.10.0      
dask                               2022.2.0    
dask-bigquery                      2021.10.1   
db-dtypes                          1.0.1       
decorator                          4.4.0       
defusedxml                         0.6.0       
distlib                            0.3.4       
distributed                        2022.2.0    
docker                             5.0.3       
docopt                             0.6.2       
docutils                           0.15.2      
entrypoints                        0.3         
et-xmlfile                         1.0.1       
fairing                            0.5.3       
fastcache                          1.1.0       
filelock                           3.0.12      
findspark                          1.3.0       
Flask                              1.1.1       
flit-core                          3.6.0       
frozenlist                         1.2.0       
fsspec                             2021.11.1   
future                             0.17.1      
gcsfs                              2021.11.1   
gevent                             1.4.0       
gitdb                              4.0.9       
GitPython                          3.1.25      
glob2                              0.7         
gmpy2                              2.0.8       
google-api-core                    1.31.5      
google-api-python-client           1.11.0      
google-auth                        1.35.0      
google-auth-httplib2               0.1.0       
google-auth-oauthlib               0.4.6       
google-cloud                       0.34.0      
google-cloud-bigquery              3.1.0       
google-cloud-bigquery-storage      2.13.1      
google-cloud-core                  1.4.4       
google-cloud-dataproc              1.1.1       
google-cloud-datastore             1.15.3      
google-cloud-language              1.3.0       
google-cloud-logging               1.15.1      
google-cloud-spanner               1.18.0      
google-cloud-storage               1.31.2      
google-cloud-translate             2.0.2       
google-crc32c                      1.3.0       
google-resumable-media             1.3.3       
googleapis-common-protos           1.54.0      
greenlet                           0.4.15      
grpc-google-iam-v1                 0.12.3      
grpcio                             1.46.3      
h5py                               2.9.0       
HeapDict                           1.0.1       
html5lib                           1.0.1       
htmlmin                            0.1.12      
httplib2                           0.18.1      
idna                               2.8         
ImageHash                          4.2.1       
imageio                            2.6.0       
imagesize                          1.1.0       
importlib-metadata                 0.23        
importlib-resources                5.1.4       
ipykernel                          5.1.2       
ipyparallel                        6.3.0       
ipython                            7.8.0       
ipython-genutils                   0.2.0       
ipython-sql                        0.4.0       
ipywidgets                         7.5.1       
isort                              4.3.21      
itsdangerous                       1.1.0       
jdcal                              1.4.1       
jedi                               0.15.1      
jeepney                            0.4.1       
Jinja2                             2.10.3      
jinja2-time                        0.2.0       
joblib                             0.13.2      
json5                              0.8.5       
jsonschema                         3.0.2       
jupyter                            1.0.0       
jupyter-client                     5.3.5       
jupyter-console                    6.0.0       
jupyter-contrib-core               0.3.3       
jupyter-contrib-nbextensions       0.5.1       
jupyter-core                       4.6.3       
jupyter-gcs-contents-manager       0.0.1       
jupyter-highlight-selected-word    0.2.0       
jupyter-http-over-ws               0.0.8       
jupyter-latex-envs                 1.4.6       
jupyter-nbextensions-configurator  0.4.1       
jupyterhub                         1.0.0       
jupyterlab                         1.1.4       
jupyterlab-git                     0.11.0      
jupyterlab-server                  1.0.6       
keyring                            18.0.0      
kiwisolver                         1.1.0       
kubernetes                         21.7.0      
lazy-object-proxy                  1.4.2       
libarchive-c                       2.8         
lief                               0.9.0       
llvmlite                           0.29.0      
locket                             0.2.0       
lxml                               4.4.1       
Mako                               1.1.4       
Markdown                           3.2.2       
MarkupSafe                         1.1.1       
matplotlib                         3.1.1       
mccabe                             0.6.1       
metakernel                         0.27.5      
missingno                          0.4.3       
mistune                            0.8.4       
mkl-fft                            1.0.14      
mkl-random                         1.1.0       
mkl-service                        2.3.0       
mock                               3.0.5       
more-itertools                     7.2.0       
mpmath                             1.1.0       
msgpack                            0.6.1       
multidict                          5.2.0       
multipledispatch                   0.6.0       
mypy-extensions                    0.4.3       
navigator-updater                  0.2.1       
nbclient                           0.5.9       
nbconvert                          5.6.0       
nbdime                             1.0.7       
nbformat                           4.4.0       
nest-asyncio                       1.5.4       
networkx                           2.6.3       
nltk                               3.4.5       
nose                               1.3.7       
notebook                           6.0.3       
numba                              0.45.1      
numexpr                            2.7.0       
numpy                              1.21.6      
numpydoc                           0.9.1       
oauth2client                       4.1.3       
oauthlib                           3.1.0       
olefile                            0.46        
opencv-python                      4.2.0.34    
openpyxl                           3.0.0       
packaging                          21.3        
pamela                             1.0.0       
pandas                             1.2.5       
pandas-gbq                         0.17.5      
pandas-profiling                   2.9.0       
pandocfilters                      1.4.2       
papermill                          2.1.3       
parso                              0.5.1       
partd                              1.0.0       
path.py                            12.0.1      
pathlib2                           2.3.5       
pathspec                           0.9.0       
patsy                              0.5.1       
pep8                               1.7.1       
pexpect                            4.7.0       
phik                               0.12.0      
pickleshare                        0.7.5       
Pillow                             8.3.2       
Pillow-SIMD                        7.0.0.post4 
pip                                19.2.3      
pipreqs                            0.4.11      
pkginfo                            1.5.0.1     
platformdirs                       2.4.1       
pluggy                             0.13.0      
ply                                3.11        
portalocker                        2.3.0       
poyo                               0.5.0       
prettytable                        0.7.2       
prometheus-client                  0.7.1       
prompt-toolkit                     2.0.10      
proto-plus                         1.20.5      
protobuf                           3.13.0      
psutil                             5.6.3       
ptyprocess                         0.6.0       
py                                 1.8.0       
py4j                               0.10.7      
pyarrow                            8.0.0       
pyasn1                             0.4.8       
pyasn1-modules                     0.2.8       
pycodestyle                        2.5.0       
pycosat                            0.6.3       
pycparser                          2.19        
pycrypto                           2.6.1       
pycurl                             7.43.0.3    
pydata-google-auth                 1.4.0       
pydot                              1.4.2       
pyflakes                           2.1.1       
Pygments                           2.4.2       
PyJWT                              2.0.1       
pylint                             2.4.2       
pyodbc                             4.0.27      
pyOpenSSL                          19.0.0      
pyparsing                          2.4.2       
PyQt5                              5.12.3      
PyQt5-sip                          12.11.1     
PyQtWebEngine                      5.12.1      
pyrsistent                         0.15.4      
PySocks                            1.7.1       
pyspark                            2.4.8       /usr/lib/spark/python
pytest                             5.2.1       
pytest-arraydiff                   0.3         
pytest-astropy                     0.5.0       
pytest-doctestplus                 0.4.0       
pytest-openfiles                   0.4.0       
pytest-remotedata                  0.3.2       
python-dateutil                    2.8.0       
python-slugify                     5.0.2       
pytz                               2019.3      
PyWavelets                         1.0.3       
PyYAML                             6.0         
pyzmq                              18.1.0      
QtAwesome                          0.6.0       
qtconsole                          4.5.5       
QtPy                               1.9.0       
requests                           2.27.1      
requests-oauthlib                  1.3.0       
rope                               0.14.0      
rsa                                4.6         
ruamel-yaml                        0.15.46     
scikit-image                       0.15.0      
scikit-learn                       0.21.3      
scipy                              1.3.1       
seaborn                            0.9.0       
SecretStorage                      3.1.1       
Send2Trash                         1.5.0       
setuptools                         41.4.0      
shap                               0.40.0      
simplegeneric                      0.8.1       
singledispatch                     3.4.0.3     
six                                1.15.0      
sklearn                            0.0.post1   
slicer                             0.0.7       
smmap                              5.0.0       
snowballstemmer                    2.0.0       
sortedcollections                  1.1.2       
sortedcontainers                   2.1.0       
soupsieve                          1.9.3       
sparkmonitor-s                     0.0.22      
Sphinx                             2.2.0       
sphinxcontrib-applehelp            1.0.1       
sphinxcontrib-devhelp              1.0.1       
sphinxcontrib-htmlhelp             1.0.2       
sphinxcontrib-jsmath               1.0.1       
sphinxcontrib-qthelp               1.0.2       
sphinxcontrib-serializinghtml      1.1.3       
sphinxcontrib-websupport           1.1.2       
spyder                             3.3.6       
spyder-kernels                     0.5.2       
spylon                             0.3.0       
spylon-kernel                      0.4.1       
SQLAlchemy                         1.3.9       
sqlparse                           0.4.2       
statsmodels                        0.10.1      
sympy                              1.4         
tables                             3.5.2       
tangled-up-in-unicode              0.2.0       
tblib                              1.7.0       
tenacity                           7.0.0       
terminado                          0.8.2       
testpath                           0.4.2       
text-unidecode                     1.3         
textwrap3                          0.9.2       
tomli                              1.2.3       
toolz                              0.10.0      
tornado                            6.1         
tqdm                               4.62.3      
traitlets                          4.3.3       
typed-ast                          1.5.1       
typing-extensions                  4.0.1       
unicodecsv                         0.14.1      
uritemplate                        3.0.1       
urllib3                            1.24.2      
virtualenv                         20.0.35     
visions                            0.5.0       
wcwidth                            0.1.7       
webencodings                       0.5.1       
websocket-client                   1.2.3       
Werkzeug                           0.16.0      
wheel                              0.33.6      
widgetsnbextension                 3.5.1       
wrapt                              1.11.2      
wurlitzer                          1.0.3       
xgboost                            1.6.2       
xlrd                               1.2.0       
XlsxWriter                         1.2.1       
xlwt                               1.3.0       
yarg                               0.1.9       
yarl                               1.7.2       
zict                               1.0.0       
zipp                               0.6.0

Hi @boot329, welcome here!

Could you give more detail on the code you’re using to load the data, and which error do you get (or does the execution just hang?)? Do you have errors only when you interrupt the kernel? Did you use the Dashboard to get some insights of what could be the problem?

Hi @guillaumeeb

Of course. Brief overview: I am running Jupyter within Google’s GCP Dataproc environment (4 worker machines) - I am importing a dask dataframe from Google Big Query (using the dask-bigquery package) for scoring via an xgboost model and the resulting output is a Dask Dataframe with 3 columns (prediction of 0, prediction of 1, and ID) - I normally write this to a GCS bucket in parquet format (done this in the past with no failure multiple times).

I would love to use the dashboard but since I am running Jupyter on this dataproc machine - whenever I click on the dashboard link when viewing the client object - results in a 404 error page (most likely due to I am on a virtual machine? Is there a way to do this?)

Here is the code below:

import pandas as pd
import numpy as np
import dask_bigquery
import xgboost as xgb
import dask.array as da
import dask.distributed
import dask.dataframe as dd 
from google.cloud import bigquery as bq

# Convert Bigquery Table to Dask Dataframe
ddf = dask_bigquery.read_gbq(
    project_id="dummy_projectid",
    dataset_id="dummy_datasetid",
    table_id="dummy_tableid"
)

cluster = dask.distributed.LocalCluster()
client = dask.distributed.Client(cluster)

clf = xgb.Booster({'nthread': 4})  # init model
clf.load_model('dummy_model.json')  # load pre trained model

# X and y must be Dask dataframes or arrays
X = ddf.loc[:,ddf.columns.isin(clf.feature_names)]   
y = ddf.loc[:,'Y_target']

clf = xgb.dask.DaskXGBClassifier(n_estimators=500, tree_method="hist", objective = 'binary:logistic', random_state =7, n_jobs = 40, eval_metric = 'auc')
clf.client = client  # assign the client
clf.fit(X, y)


## Get New Data for scoring

# Convert Bigquery Table to Dask Dataframe
scoring_df = dask_bigquery.read_gbq(
    project_id="dummy_projectid",
    dataset_id="dummy_datasetid",
    table_id="dummy_tableid"
)
X_predict = scoring_df .loc[:,scoring_df .columns.isin(clf.feature_names_in_)] # get X Variables
predictions = clf.predict_proba(X_predict)
predictions_df = dd.from_array(predictions) #convert to dask dataframe

predictions_df = predictions_df.repartition(npartitions=1) 
predictions_df['hh_id'] = scoring_df .thd_hh_id.repartition(npartitions=1)

# export predictions to GCS bucket prior to importing to GBQ table 
predictions_df.to_parquet(dummy_gcs_location)

Also a quick call out: Even when I try to write the Dask Dataframe to a CSV file locally (predictions_df.to_csv(‘path/dummy.csv’) - it creates a folder in that location with the csv name but it just keeps spinning/no data is being saved

I took the liberty to edit your post to insert code cells for better readability.

You should be able to use the Dashboard using Jupyter proxy, with an address looking like the following: https://your_jupyter_url/user/your_user_name/proxy/8787/status (assuming the Scheduler run on port 8787).

For the writing problem, well it’s still hard to tell. Are you able to write some dummy created DataFrame on your GCS bucket? Since it doesn’t work either with a CSV, I guess the problem is within the processing phase. Do you get some errors or the code just hang? Could you try with a lighter XGBoost configuration?
Also, since at the end you are repartitionning the results with only one partition, I guess it can fit in memory. Could you try to just execute predictions.compute()?

Ah apologies - I will format any code like that going forward.

So I was successful in importing a dask dataframe (100 records) and immediately write it to the GCS bucket with the .to_parquet() method. I have had success running the entire process before exporting to parquet and viewing the top rows using predictions_df.head() method - however when using the .compute() method its super slow/ have not had success.

I recently ran the process it and finally returned a long list of errors… does anything jump out?
Again thank you for any tips!

tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <zmq.eventloop.ioloop.ZMQIOLoop object at 0x7f1f299fadd0>>, <Task finished coro=<Cluster._sync_cluster_info() done, defined at /opt/conda/anaconda/lib/python3.7/site-packages/distributed/deploy/cluster.py:104> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>)
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 107, in _sync_cluster_info
    value=copy.copy(self._cluster_info),
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 785, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 746, in live_comm
    **self.connection_args,
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <zmq.eventloop.ioloop.ZMQIOLoop object at 0x7f1f299fadd0>>, <Task finished coro=<Cluster._sync_cluster_info() done, defined at /opt/conda/anaconda/lib/python3.7/site-packages/distributed/deploy/cluster.py:104> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>)
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 107, in _sync_cluster_info
    value=copy.copy(self._cluster_info),
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 785, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 746, in live_comm
    **self.connection_args,
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <zmq.eventloop.ioloop.ZMQIOLoop object at 0x7f1f299fadd0>>, <Task finished coro=<Cluster._sync_cluster_info() done, defined at /opt/conda/anaconda/lib/python3.7/site-packages/distributed/deploy/cluster.py:104> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>)
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 107, in _sync_cluster_info
    value=copy.copy(self._cluster_info),
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 785, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 746, in live_comm
    **self.connection_args,
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
    await self.start()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/nanny.py", line 333, in start
    msg = await self.scheduler.register_nanny()
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 860, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1048, in connect
    raise exc
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/core.py", line 1032, in connect
    comm = await fut
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 310, in connect
    ) from active_exception
OSError: Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/anaconda/lib/python3.7/asyncio/tasks.py:623> exception=OSError('Timed out trying to connect to tcp://127.0.0.1:33441 after 30 s')>
Traceback (most recent call last):
  File "/opt/conda/anaconda/lib/python3.7/site-packages/distributed/comm/core.py", line 286, in connect
    timeout=min(intermediate_cap, time_left()),
  File "/opt/conda/anaconda/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

The above exception was the direct cause of the following exception:


RuntimeError: can't start new thread

Dask is lazy, so if you only call for the head of the result, it will compute only the partitions it needs,probably just one, so a small computation. The problem must be when you try to launch all the computation, something must be too heavy for your cluster to handle.

It seems that somehow your Workers are overloaded and don’t respond anymore. I would first try without the repartitioning at the end of the computation (is this really needed?), then I would go for a smaller model to learn, and ultimately try how it goes with less data if possible.

It looks like you are using a very old version of dask and xgboost is there a reason for that? Would you be able to update the dask and xgboost versions to the most recent version and see if performance improves?

I was able to update the packages (xgboost/dask) but still was slow - looking at the worker error logs there was similar errors noted above.

UPDATE: What finally seemed to fix the issue was to upgrade the master and worker machines to have higher memory - took some time but to_parquet did execute!

Well looking at your code here

predictions_df = predictions_df.repartition(npartitions=1) 
predictions_df['hh_id'] = scoring_df .thd_hh_id.repartition(npartitions=1)

Things take a long time and memory because you end up with only 1 partition. In this case, only one of the workers will be doing the work here, and you are not using parallelism to write to_parquet. Is there a reason why you are not writing things into multiple parquet files?

You can repartition based on memory by passing partition_size=100MB this is a rule of thumb and then writing to_parquet .