Missing pyarrow

After a fair amount of wrestling (turns out docker is broken with current default AMI image on AWS), we got EC2 cluster up and running. Trying to use the default docker image, but a basic read_parquet() command results in

File /opt/conda/lib/python3.8/site-packages/distributed/protocol/pickle.py:66, in loads()
File /opt/conda/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py:7, in
ModuleNotFoundError: No module named ‘pyarrow’

Is pyarrow really not in the default docker image (seems very hard to believe), or am I misunderstanding something fundamental about how this works? We did get the following warning:

/home/martin/.conda/envs/martin/lib/python3.9/site-packages/distributed/client.py:1278: VersionMismatchWarning: Mismatched versions found

+-------------+-----------+-----------+---------+
| Package     | client    | scheduler | workers |
+-------------+-----------+-----------+---------+
| cloudpickle | 2.0.0     | 2.1.0     | None    |
| dask        | 2022.02.1 | 2022.05.2 | None    |
| distributed | 2022.2.1  | 2022.5.2  | None    |
| lz4         | None      | 4.0.0     | None    |
+-------------+-----------+-----------+---------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

@martin Welcome!

Looks like pyarrow actually isn’t there, but I agree that it should be. :confused:

You can use the $EXTRA_CONDA_PACKAGES and $EXTRA_PIP_PACKAGES environment variables to include it for now.

turns out docker is broken with current default AMI image on AWS

Thanks for sharing! I’d encourage you to open an issue about this and pyarrow not being available here: GitHub - dask/dask-docker: Docker images for dask