In my data anlysis experiments i am using Dask library distributed scheduler and futures. It involves doing large scale data analysis loading, preprocessing and creating a Neo4j Graph database in a Google cloud single instance. But after a few hours while all my workers are working in full scale the go to a idle state except one worker that is working in 100 % of its cpu capacity. Then after a 30 minutes idle state some workers begins to work for a minute or so and returns to a idle state again. Why are the others workers idle? Are the doing Garbage collection Cycles? Are they waiting for any dependencies to finish? I am new to Dask library and i can not figure out what is happening. I use mostly futures in my python code!!! Can anyone give me a hint for what is happening? Thanks in advance for your time.
@kostas_vavouris Welcome to Discourse and thanks for your question!
That’s odd indeed. We’d need more details about your Dask workflow to diagnose this because a lot of things can cause this behavior. Would you be able to share a minimal, reproducible example as well as the Dask+Python version you’re running?
Generally though, maybe some topic in these best practices can help, especially the dashboard to monitor worker memory?
Hi there,
thank you for your reply. Dask version: 2022.02.0, Python version: 3.7.12. I can not share a reproducible example because the whole experiment is very complex. But i will try to explain it in more details. The experiment has to do with large volumes of compressed in lz4 format json lines data. These data are physically stored in a Google cloud Bucket and being accessed remotely through Dask library. In particularly a dask bag loads, filters, extracts features of interest, homogeneses them, cleans them up and converts them to dataframes. Then a Python client application uses Neo4j python driver and through transactions is trying to insert the dataframes in large batches in a Neo4j Community edition graph database server which is physical installed in a Google cloud virtual machine and runs under a debian linux-like operating system. The whole process is done remotely as well. For the first 2- 3 hours everything works fine and the workers are inserting rapidly data in the Neo4j database as /nodes, Relatuonships, properties etc. However after that time one after the other are falling in idle state although thy still have jobs to done. I can not figure out what is happening because i am new to using Dask library. I have used dask.distributed scheduler and configured a cluster of 8 workers, 2 threads per worker and each worker has approximately 16 GB of RAM. I have used both dask futures and delayed but the outcome is unfortunately the same. Most workers are idling while they are supposed to be doing work. To make things worse i can not use Dask dashboard because the client python application is running in a JupyterLab notebook in another Google cloud Virtual machine with 16 vCPUs and 128 GB of RAM. It is imposiible to make dask dashboardc show in link http://localhost:8787/status produced by the client object. I have tried everything like opening port 8787 through firewall rules, port forwarding of port 8787n to a new port like 8000 but nothing. The Dask dashboard refuses to show. I have installed bokeh library version 2.4.2. Something in the Google cloud jupyterLab enviroment is blocking Dask Dashboard to show its real-time statistics. So i do not have access to the valuable insights offered by Dask Dashboard. I am desperate because my project is delayed for 2 months now. I have some ideas as to what causes workers to idle: 1) is it possible task graphs created are extremely large and the scheduler is unable to fullfill its purpose?, 2) is it that memory is not syfficient and causes workers to idle?, 3) is it that workers are idling waiting data to be read from disk where they are stored because workers memory is not enough? Any other suggestions? Can anyone cotribute with his technicall skills. Thanks again for the time you took to help me!