We figure that the to_sql method takes a long time to execute.
We believe that the compute method takes time, which is executed by the to_sql method, because that will transform everything back to a pandas dataframe, am I right? According to the documentation this is needed to load the data into sql : dask.dataframe.to_sql — Dask documentation
But we believe there should be a better way, is there a way to directly move the data from a dask dataframe to sql without computing?
@ErikvdVen Welcome to Discourse!
But we believe there should be a better way, is there a way to directly move the data from a dask dataframe to sql without computing?
Dask DataFrame consists of multiple pandas DataFrames, and usually, operations happen in parallel on these pandas DataFrames. It’s important to note here that Dask isn’t really aware of the values in your DataFrame, it just knows the overall structure. So, any operation you do needs to happen on the underlying pandas DataFrames (that’s what we mean by “compute”). Does that make sense?
But, by default, DataFrame.to_sql
doesn’t operate on these underlying pandas DataFrames in parallel (because it may alter the ordering of rows in your SQL table). So, it’s basically working in serial mode here and that’s why it is taking some time. If row-order isn’t an issue, you can set the parameter parallel=True
, which will help speed things up.
The only other way I can think of is having more memory, so that you can operate on larger partitions.
Let me know if this helps!
Hi @pavithraes !
Thank you for your detailed answer! Your explanation about “compute” makes completely sense!~
I will try to set the parameter parallel
to True
. However, isn’t there any possibility to let the distributed client save everything to sql for example? Like for example, we use client.compute(dd.read_csv(...))
at the moment to let the cluster read all csv files and load them as dataframes, but is it possible to let the cluster save those dataframes to sql as well?
At the moment the process is killed constantly when I try to run client.compute(dd.to_sql(...))
, while reading works fine, so I wonder if it is even possible.
@ErikvdVen Glad to help, and thanks for your question!
However, isn’t there any possibility to let the distributed client save everything to sql for example? Like for example, we use client.compute(dd.read_csv(...))
at the moment to let the cluster read all csv files and load them as dataframes, but is it possible to let the cluster save those dataframes to sql as well?
I’m not sure that I completely understand what you mean, but I’d assume you can’t do this directly, and need to use to_sql()
here. I also feel like this doesn’t align with what Dask was designed for. We usually try to minimize calling compute
. Ideally, you can do all your operations (i.e., build the task graphs) and call compute at the very end to get the final results.
I’m also curious about your workflow here, because Dask DataFrame may not be the best solution for your usecase: Dask Dataframe and SQL — Dask documentation
At the moment the process is killed constantly when I try to run client.compute(dd.to_sql(...))
, while reading works fine, so I wonder if it is even possible.
You don’t need to use compute
explicitly here because dd.to_sql()
would call compute internally. Also, all the previous operations will be executed here (when you call compute), so we can’t be sure which line/operation is causing the issue. Would it be possible for you to share a minimal example, it’ll allow us to help you better.