AWS Lambda Image to run dask-cloudprovider[aws-fargate]

Hello !

I want to run a AWS Fargate cluster using dask-cloudprovider in a Docker Image to run it in AWS Lambda. I have installed all packages using conda and it works when I’m running it in local. However, I’d like to run it from a AWS Lambda (it is gonna be executed once a day, no need to have a EC2 all the time). I created different Dockerfiles, based on different images.
This last version is one extracted from blog-samples/2021-06-Amazonian-Conda at main · BaysC/blog-samples · GitHub .

  1. FROM public.ecr.aws/lambda/python:3.8 => I install miniconda with all the dependencies. Same problem, I can’t find my function.
    ERROR: “[Errno 30] Read-only file system: ‘/var/task/dask-worker-space’”
FROM public.ecr.aws/lambda/python:3.8

RUN yum update && yum install -y wget && yum clean all
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh && sh miniconda.sh -b -p /opt/miniconda
COPY environment.yml /tmp/environment.yml
RUN sed -i -r '/m2w64|vs2015|msys2|win|vc/d' /tmp/environment.yml
RUN /opt/miniconda/bin/conda env create --file /tmp/environment.yml --prefix /opt/conda-env
RUN /opt/conda-env/bin/pip install awslambdaric
RUN mv /var/lang/bin/python3.8 /var/lang/bin/python3.8-orig && ln -sf /opt/conda-env/bin/python /var/lang/bin/python3.8
COPY my_code.py /opt/my-code/my_code.py

ENV AWS_KEY=<KEY> \
    AWS_SECRET=<SECRET>

ENV PYTHONPATH "/var/lang/lib/python3.8/site-packages:/opt/my-code"
ENTRYPOINT ["/lambda-entrypoint.sh"]
CMD ["my_code.lambda_handler"]
  1. So, I changed to to install/copy everything in /tmp since AWS Lambda suggests it. Now the error is: “RequestId: 18… Error: Runtime exited with error: exit status 127” => /var/runtime/bootstrap: line 7: /var/lang/bin/python3.8: No such file or directory

And I have this

FROM public.ecr.aws/lambda/python:3.8

RUN yum update && yum install -y wget && yum clean all
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh && sh miniconda.sh -b -p /opt/miniconda
COPY environment.yml /tmp/environment.yml
RUN sed -i -r '/m2w64|vs2015|msys2|win|vc/d' /tmp/environment.yml
RUN /opt/miniconda/bin/conda env create --file /tmp/environment.yml --prefix /tmp/opt/conda-env
RUN /tmp/opt/conda-env/bin/pip install awslambdaric
RUN mv /var/lang/bin/python3.8 /var/lang/bin/python3.8-orig && ln -sf /tmp/opt/conda-env/bin/python /var/lang/bin/python3.8
COPY my_code.py /tmp/opt/my-code/my_code.py

ENV AWS_KEY=<KEY> \
    AWS_SECRET=<SECRET>

ENV PYTHONPATH "/var/lang/lib/python3.8/site-packages:/tmp/opt/my-code"
ENTRYPOINT ["/lambda-entrypoint.sh"]
CMD ["my_code.lambda_handler"]

If I run it with a different endpoint just to check the error, and run /var/lang/bin/python3.8 the file is actually there so I don’t understand what’s going on.

PD: The my_code.py file creates a AWS Fargate cluster like:

cluster = FargateCluster(
            scheduler_mem=8192, n_workers=12, worker_cpu=256, worker_mem=1024, 
            aws_access_key_id=AWS_KEY,
            aws_secret_access_key=AWS_SECRET,
            image="<dependencies_image>",
            cloudwatch_logs_group="<cloudwatch_group>"
           )

Please let me know if anyone has an idea of what may be going on ! Thanks in advance !

2 Likes

UPDATE

Finally the Dockerfile is working, it creates the cluster but when waiting for the reponse, as AWS Lambda execution environment not having /dev/shm (shared memory for processes) support I get
sl = self._semlock = _multiprocessing.SemLock( future: <Task finished name='Task-2814' coro=<_wrap_awaitable() done, defined at /opt/conda-env/lib/python3.8/asyncio/tasks.py:688> exception=OSError(38, 'Function not implemented')>
I know dask uses pools to multiprocess, but is there any way to use multiprocessing.Pipe as exaplained in Parallel Processing in Python with AWS Lambda | AWS Compute Blog ? Thanks in advance !

Hey! Nice to virtually meet you.

Glad you finally got that Dockerfile working. I’ve seen that it’s tricky to run Dask on AWS Lambda in general, see the heading “Reality” in this blog post by @jacobtomlinson : Exploring Dask and Distributed on AWS Lambda | by Jacob Tomlinson | Met Office Informatics Lab | Medium

About using multiprocessing.Pipe , you can write functions that use it, but I don’t think Dask can directly use it. I’ll have to look into it more. But, that said, to get rid of the error, you can try passing processes=False to the Client. Ref: dask worker: daemonic processes are not allowed to have children · Issue #2142 · dask/distributed · GitHub

Are you free sometime this week to go over your workflow a bit more? Sometimes it’s easier to have a quick convo to work on diagnosing why you’re getting the “No such file or directory” error and all that jazz.

If you’re interested, here’s a link to my calendar: Meetings What time works best for you?

2 Likes

Hello ! Nice to virtually meet you too. Thanks for your answer, it definitively helped my getting rid of that error. However, seems like there are some errors that come up like TypeError: ‘Serialize’ object is not subscriptable when reading parquet dataset with Client(processes=False). Do you have any idea on this one? :slight_smile:

Sure ! I’ll schedule it, thanks !

1 Like

Investigating now! And looking forward to talking then :slight_smile:

1 Like