Distribute large volume of data processing

senthil-bungee · January 30, 2022, 4:49am

Hello,

We have around 750 million pair products in s3. We want to go through ML model for each pair and generate its matching score. We are using specialized EC2 instances with inferencia chips. Hence our model is precompiled to load into these type of machines alone. Now the question is

Looked at cloud provider API, which creates EC2 clusters with scheduler and workers. Is it possible to configure scheduler to run in different type instances and specialized instances for workers?
We would like to load the model into workers before any processing. Should I use workers command to load them?
We are thinking to write Athena query using aws wrangler to get them as dataframe (around 5 million product pairs) iterators and feed them into a method that process each dataframe in sequentially. So, we will write code to fetch dataframe on the dask client and submit a dataframe to EC2 cluster. Is this right approach or any other better appraoch available? Now, client is holding responsible of holding logic to chuck and process them. Hence I need to maintain journal to keep track of submitting dataframes. I need to take care of client failure to restart client from the place where it left. Is it correct approach? What will happen if during one of dataframe processing, scheduler fails due to some network issue or so?

Please provide some insights to design this requirement.

pavithraes · January 31, 2022, 6:36pm

@senthil-bungee Hi and welcome to Discourse! Thanks for following up here from this Stack Overflow question!

Looked at cloud provider API, which creates EC2 clusters with scheduler and workers. Is it possible to configure scheduler to run in different type instances and specialized instances for workers?

Maybe @jacobtomlinson can help confirm?

We would like to load the model into workers before any processing. Should I use workers command to load them?

We are thinking to write Athena query using aws wrangler to get them as dataframe (around 5 million product pairs) iterators and feed them into a method that process each dataframe in sequentially. So, we will write code to fetch dataframe on the dask client and submit a dataframe to EC2 cluster. Is this right approach or any other better appraoch available? Now, client is holding responsible of holding logic to chuck and process them. Hence I need to maintain journal to keep track of submitting dataframes. I need to take care of client failure to restart client from the place where it left. Is it correct approach? What will happen if during one of dataframe processing, scheduler fails due to some network issue or so?

A minimal, reproducible (if possible) example might help here. For example, where is the model stored before loading? Even some sample functions to illustrate the workflow will help a lot.

It may also be worth opening new topics for both of these questions, to allow for better and more dedicated discussions.

jacobtomlinson · February 21, 2022, 5:13pm

To answer 1 and 2.

The EC2Cluster class takes an instance_type kwarg to configure your instances.
Calling Client.run can be helpful for executing a function on all workers. So for loading a model this could be useful.

Topic		Replies	Views
Dask AWS ECS Cluster VS AWS Ec2 instance + Local Cluster Distributed	16	1012	March 8, 2023
Basic question about Dask AWS cloudprovider and scheduled/routine processing dask-cloudprovider	2	96	April 3, 2024
Mounting a folder in Dask distributed AWS ECS/EC2 cluster Distributed	9	373	March 6, 2023
Uploading data files to the EC2Cluster Distributed distributed	3	117	February 15, 2024
FileNotFoundError when reading input in distributed cluster	2	198	February 24, 2025

Distribute large volume of data processing

Related topics