After a fair amount of wrestling (turns out docker is broken with current default AMI image on AWS), we got EC2 cluster up and running. Trying to use the default docker image, but a basic read_parquet() command results in
File /opt/conda/lib/python3.8/site-packages/distributed/protocol/pickle.py:66, in loads()
File /opt/conda/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py:7, in
ModuleNotFoundError: No module named ‘pyarrow’
Is pyarrow really not in the default docker image (seems very hard to believe), or am I misunderstanding something fundamental about how this works? We did get the following warning:
/home/martin/.conda/envs/martin/lib/python3.9/site-packages/distributed/client.py:1278: VersionMismatchWarning: Mismatched versions found
+-------------+-----------+-----------+---------+
| Package | client | scheduler | workers |
+-------------+-----------+-----------+---------+
| cloudpickle | 2.0.0 | 2.1.0 | None |
| dask | 2022.02.1 | 2022.05.2 | None |
| distributed | 2022.2.1 | 2022.5.2 | None |
| lz4 | None | 4.0.0 | None |
+-------------+-----------+-----------+---------+
warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))