I am trying to build ETL application. Whenever I tried to migrate huge data from mongo database to mysql database application server’s capacity is not enough to do it. Will Dask framework help me to do it?
Hi @RamaSubramanian, welcome here.
Your workflow lacks a bit of information. What’s your application server? What are you doing currently to migrate this data? Which capacity of the server is not enough?
Dask will help you to parallelize or even distribute the process on several servers. But will MySQL or even Mongo be able to handle big load of queries?
Hi @guillaumeeb , Thanks for the response.
We have build a data warehouse(MySQL) in our in-house Ubuntu server. For migrating data from various sources(csv, XML, Rest-APIs, RDBMS, No-SQL) to our DWH, I have been using core python scripts(ETL process) for last few months. But whenever I tried to migrate huge data to DWH, It could not handle big load.
My question is, Will Dask(With prefect) help me to do this bulk load process by using kinda micro process?
What is the limitation you are facing ? CPU, Memory, Disk ?
Dask will help you write and parallelize you workflow to launch several process at the same time, or split your workload in small pieces, is it what you want?
Limitation depends on data size. For an example if I want to migrate 1gb data in 10 minutes for specific table(collection), then my server lost his memory and cpu utilization which leads to terminate the process and data loss.
As you mentioned , Dask splits the process into sub process and do the required task, I need some sort of examples or documentation to know about it and implement it.
1gb data is quite small, it means you’ve got to query really little data from Mongo piece by piece.
It’s still hard to help you more without some code sample of how you query Mongo and feed data into MySQL. But from here, I’m under the impression that Dask won’t be magical, you might need to manually help it to split your input data into pieces.
Thanks. I will share my script soon with dask implementation.
My another question is, Can you tell me the difference to use Multi threading in python and dask in python.
Will they work similarly or not ?
Dask can do multi-theading, multi-processing, and distributed processing among several servers, with the same APIs. Dask also offers high level APIs such as Dataframes, Arrays, and low level one such as Future and Delayed. Using multithreaded Dask and Future API is pretty much like using multi threading. Dask will really help in your case if your take advantage of its APIs.
I would recommend reading (again) the first pages of Dask documentation to have a better understanding of what it could do for you: