Install dependencies on EC2Cluster

When start an EC2Cluster with

cluster = EC2Cluster(region="us-east-1",
                     # the Name is the IAM role name
                     iam_instance_profile=
                         {'Name': 'dask-cluster-ec2-role'},
                     n_workers=4, 
                     worker_instance_type='t2.medium', 
                     scheduler_instance_type='t2.micro',              
                     security=False)

and later try to create a dask dataframe from s3 I get an error message when I run a compute() method. Saying that s3fs is not installed. Is there a way to install this via an argument to EC2Cluster? If not what are alternatives?

Hi @Sacha_Van_Weeren, welcome to Dask Discourse forum,

I think the easiest solution would be to give some environement variable to the docker images, typically adding a kwarg:

cluster = EC2Cluster(region="us-east-1",
                     # the Name is the IAM role name
                     iam_instance_profile=
                         {'Name': 'dask-cluster-ec2-role'},
                     n_workers=4, 
                     worker_instance_type='t2.medium', 
                     scheduler_instance_type='t2.micro',              
                     security=False,
                     env_vars={'EXTRA_CONDA_PACKAGES': 's3fs'})

Thanks a lot for your response. However when I run the above command. The cluster creations seems to hang. I tried it 2 times with the env-vars and 1 without. Without it continues with it, it gets stuck

Maybe I didn’t give the correct syntax. Do you have some output logs of the containers start?

Another method a bit more complex would be to build you own docker image based on the Dask one with added dependencies. Ot you could use existing ones like the Pangeo docker images.

cc @jacobtomlinson