Distribute large volume of data processing

Hello,

We have around 750 million pair products in s3. We want to go through ML model for each pair and generate its matching score. We are using specialized EC2 instances with inferencia chips. Hence our model is precompiled to load into these type of machines alone. Now the question is

  1. Looked at cloud provider API, which creates EC2 clusters with scheduler and workers. Is it possible to configure scheduler to run in different type instances and specialized instances for workers?
  2. We would like to load the model into workers before any processing. Should I use workers command to load them?
  3. We are thinking to write Athena query using aws wrangler to get them as dataframe (around 5 million product pairs) iterators and feed them into a method that process each dataframe in sequentially. So, we will write code to fetch dataframe on the dask client and submit a dataframe to EC2 cluster. Is this right approach or any other better appraoch available? Now, client is holding responsible of holding logic to chuck and process them. Hence I need to maintain journal to keep track of submitting dataframes. I need to take care of client failure to restart client from the place where it left. Is it correct approach? What will happen if during one of dataframe processing, scheduler fails due to some network issue or so?

Please provide some insights to design this requirement.

@senthil-bungee Hi and welcome to Discourse! Thanks for following up here from this Stack Overflow question!

  1. Looked at cloud provider API, which creates EC2 clusters with scheduler and workers. Is it possible to configure scheduler to run in different type instances and specialized instances for workers?

Maybe @jacobtomlinson can help confirm?

  1. We would like to load the model into workers before any processing. Should I use workers command to load them?
  2. We are thinking to write Athena query using aws wrangler to get them as dataframe (around 5 million product pairs) iterators and feed them into a method that process each dataframe in sequentially. So, we will write code to fetch dataframe on the dask client and submit a dataframe to EC2 cluster. Is this right approach or any other better appraoch available? Now, client is holding responsible of holding logic to chuck and process them. Hence I need to maintain journal to keep track of submitting dataframes. I need to take care of client failure to restart client from the place where it left. Is it correct approach? What will happen if during one of dataframe processing, scheduler fails due to some network issue or so?

A minimal, reproducible (if possible) example might help here. For example, where is the model stored before loading? Even some sample functions to illustrate the workflow will help a lot.

It may also be worth opening new topics for both of these questions, to allow for better and more dedicated discussions.

To answer 1 and 2.

  1. The EC2Cluster class takes an instance_type kwarg to configure your instances.
  2. Calling Client.run can be helpful for executing a function on all workers. So for loading a model this could be useful.
1 Like