Dask.read_sql_table() too slow

Hi Team,

I am working on a project which has 85 million records with 25 columns and the task is to use read_sql_table() with partitioning , but while using this it takes 90 min to load the data, i m not calling .compute method yet.
Any help around here will be appreciated.

df= dt.read_sql_table(table_name="*",con= s, index_col='') ← this takes more than 90 min

How can i distribute on workers? such that while calling .compute i can call it only on a partition rather than whole.?

Hi @mohitbansal-gep,

This line isn’t loading any data. It will just read a few data and collect max/min index values from the database in order to infer the number of partitions to create for building the Dask DataFrame.

It is strange that it takes so much time, maybe because your database is huge and slow to respond to simple queries. Maybe you should look at other kwargs to give to optimize this. See the corresponding documentation:

If neither divisions or npartitions is given, the memory footprint of the first few rows will be determined, and partitions of size ~256MB will be used.

Once the divisions are known or computed, when calling compute, the work will automatically be distributed among Workers.

Yes The Database is Huge 60Gb of Data.
So the Idea is to pre load the dask dataframe in memory and call /query on dataframe as required. via partition[n].compute()
Can you help?

also sometimes i am getting this
" Event loop was unresponsive in Worker for 19.30s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability."

Well, it’s difficult to help a lot more here withtout knowing how does the table your trying to read looks. You need to give Dask some hints on how to read this data.

Also, one weird thing I notice in your code is the kwarg

Are you not trying to read only one table?

clnt = Client('Remote Cluster Address')
stringconnect = 'mssql+pyodbc:///?odbc_connect=%s' % params
df = dt.read_sql(sql=s1,con=stringconnect,index_col=orderBy,bytes_per_chunk="20 MiB")  # This takes 90 min
df = clnt.persist(df) # <- This spawns 21 workers pods.
clnt.publish_dataset(df,name='test')

# Now from other notebook
df = clnt.get_dataset('test')
df.npartitions # <- this prints 15059
df.partitions[0].compute() #<- expected is pandas data frame, but this takes forever to compute greater than 100 min, but no result

We have deployed dask clusters through helm.

Please ignore the typos, table_name=“ThisIsSomeTable”

Also while trying to get client in above code from another notbook we get OSError timeout , connectingtig to treafik gateway

It’s really hard to tell what could be the problem. Are you able to issue simple SQL commands with descent delay from the same server as the one you’re using to create the Dask Client?

Yes i am able fetch top 1000 rows in mere time of 30 sec.

@guillaumeeb My Issue is ,even after taking 90 min for read_sql_table and persist and publish, .compute() shouldn’t take time and i am assuming .compute is again firing the sql query, which it shouldn’t as i have already called persist. So what am i missing here.?

Is that the same thing for a count?

You are right, if the persist call has really ended, compute should only take the time to fetch the results. Did you have a look at the Dask Dashboard to have more insights on what was happening?