Basic question about Dask AWS cloudprovider and scheduled/routine processing

Hi,

Assuming the desire to perform routine processing using Fargate each month and the desire to use Dask CloudProvider AWS API and specifically the FargateCluster, what is the best method of scheduling the cluster deployment and tasks? Currently the cluster and client are created through a Python script on my local machine. The tasks would end up writing processed data to S3 bucket.

For example, do I need to create a lambda function that launches an ec2 to run the code to create the FargateCluster and/or client-- an operation which is currently done on my local machine? However, the idea of that seems counterintuitive to the benefits of fargate which rest in the serverless approach.

A very basic example:

from distributed import Client
from dask_cloudprovider.aws import FargateCluster

cluster = FargateCluster(
region_name=‘us-east-1’,
aws_access_key_id=“xxx”,
aws_secret_access_key=“xxx”,
image = “daskdev/dask:2024.3.1-py3.12”,
n_workers=1)

cluster.adapt(minimum=0, maximum=1)

#Could also be specified in docker container
def do_work(n):
return n + 1

client = Client(cluster)

rs = client.gather(client.map(do_work, [ 1 ]))

print(rs)

client.close()
cluster.close()

Hi @beder101, welcome to Dask Discourse forum,

I’m under the impression that your question is more about AWS than Dask. In short, you want to implement a CRON based approach that will launch your script. If Lambda alone is not enough (you’re code run for more than 15 minutes?), then you can probably find other approaches, but it would be better to ask on an AWS forum.

I’ve google a bit, and found a few resources:

I’m not sure how relevant they are. Maybe others here like @jacobtomlinson have more experience than me on this subject.

You might also be interested in some kind of workflow manager like Prefect.