Importing nyc-taxi dataset not working

Hi,

I was trying to run the folloing example code mentioned in the official docs:

ddf = dd.read_parquet(
“s3://dask-data/nyc-taxi/nyc-2015.parquet/part.*.parquet”,
columns=[“passenger_count”, “tip_amount”],
storage_options={“anon”: True},
)

However, I get an error: "The following columns were not found in the dataset: tip_amount passenger_count The following columns were found Index(, dtype=‘object’)

Seems like the data cannot be loaded. What is the issue here?

Thanks for any help.

Hi @BAER, welcome to Dask community,

I just try the code from this example in my environment, and it worked like a charm. Which Dask version are you using?

Thanks for your fast answer!

I think the problem may be on my side, as it might be the proxy settings do not allow me to access s3://dask-data/nyc-taxi

Is there a way to check this, if the proxy settings are the problem?

Dask version in my first venv is 2023.6.0 and in my other it is 2023.8.1

Are you able to execute the following code:

import s3fs
s3 = s3fs.S3FileSystem(anon=True)
s3.ls('dask-data/nyc-taxi/nyc-2015.parquet/')
1 Like

Thanks for your answer. No, I can’t.

When I run the code I get an error:
SSLError: SSL validation failed for https://dask-data.s3.amazonaws.com/… Cannot connect to host dask-data.s3.amazonaws.com:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICAT_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate

Then as you said, this is probably a proxy or network configuration problem, you should ask your IT service if anything can be done for it.

I just double checked that I can access the path using a random non-coiled AWS account