Hi I have a data lake whose data is stored on S3.
Python version: 3.8
dask version: "dask[complete]"==2023.5.0
I have to read this data from S3 and do certain aggregations. I cannot directly access the S3 and I have to assume a role. For the sake of simplicity I have given S3FullAccess to this role that I assume in my code.
As we know the dask DataFrame do not have the capability to load AVRO data so we need to use Daskbag.
I am facing a weird issue when I try to access AVRO files from DaskBag I get this Access Denied: Operation ListObjectV2
however when using DataFrame to load data from S3 for other file formats I see no error and program actually loads the data correctly for other file formats like Parquet/ORC/CSV.
Here is a code that runs properly:
import dask.dataframe as dd
import os
ACCESS_KEY='MY_KEY'
SECRET_KEY='KEY_SEC_KEY'
SESSION_TOKEN='MY_LONG_TOKEN/Gm+YYzuj46'
path = "s3://prod-v2-datalake/dc-dataload/data/warehouse/tablespace/external/hive/us_customers_icebeg/data/*"
os.environ["AWS_ACCESS_KEY_ID"]=ACCESS_KEY
os.environ["AWS_SECRET_ACCESS_KEY"]=SECRET_KEY
os.environ["AWS_SESSION_TOKEN"]=SESSION_TOKEN
os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
os.environ["AWS_REGION"] = "us-west-2"
df = dd.read_parquet(path)
df.compute()
print(df.head())
I see the dataframe to load data correctly
However I see the error with following program:
import os
import dask.bag as dd
ACCESS_KEY='MY_KEY'
SECRET_KEY='KEY_SEC_KEY'
SESSION_TOKEN='MY_LONG_TOKEN/Gm+YYzuj46'
path = "s3a://prod-v2-datalake/dc-dataload/data/warehouse/tablespace/external/hive/us_customers_avro/000000_0"
os.environ["AWS_ACCESS_KEY_ID"]=ACCESS_KEY
os.environ["AWS_SECRET_ACCESS_KEY"]=SECRET_KEY
os.environ["AWS_SESSION_TOKEN"]=SESSION_TOKEN
os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
os.environ["AWS_REGION"] = "us-west-2"
s3_opts = {'anon': True, 'use_ssl': False, 'key': ACCESS_KEY, 'secret':SECRET_KEY, 'token': SESSION_TOKEN}
df = dd.read_avro(path, storage_options=s3_opts)
df.compute()
print(df.head())
Error:
Traceback (most recent call last):
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/s3fs/core.py", line 720, in _lsdir
async for c in self._iterdir(
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/s3fs/core.py", line 770, in _iterdir
async for i in it:
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/aiobotocore/paginate.py", line 30, in __anext__
response = await self._make_request(current_kwargs)
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/aiobotocore/client.py", line 408, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "avro_test.py", line 25, in <module>
df = dd.read_avro(path, storage_options=s3_opts)
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/dask/bag/avro.py", line 102, in read_avro
fs, fs_token, paths = get_fs_token_paths(
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/fsspec/core.py", line 657, in get_fs_token_paths
paths = [f for f in sorted(fs.glob(paths)) if not fs.isdir(f)]
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/fsspec/asyn.py", line 103, in sync
raise return_result
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/s3fs/core.py", line 799, in _glob
return await super()._glob(path, **kwargs)
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/fsspec/asyn.py", line 804, in _glob
allpaths = await self._find(
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/s3fs/core.py", line 829, in _find
return await super()._find(
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/fsspec/asyn.py", line 846, in _find
if withdirs and path != "" and await self._isdir(path):
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/s3fs/core.py", line 1480, in _isdir
return bool(await self._lsdir(path))
File "/home/ec2-user/.local/share/virtualenvs/profiler-9--y97BQ/lib/python3.8/site-packages/s3fs/core.py", line 733, in _lsdir
raise translate_boto_error(e)
PermissionError: Access Denied
I have verified the correct permissions on AWS IAM role. It cannot be a role issue as Praquet and ORC files we can read properly.
Can I get some direction how can I load the AVRO files ? I appreciate your time and attention in advance.