Cloud Storage and Dask

Hello

I would like to migrate some local ETL process from Apache Spark to Dask. In order to do this I have to be able to read and write data from/to Cloud Storage (in form of parquet).

Can Dask do this?, I have tried to do this without success. I share one of the code that I have been triying:

from google.oauth2.credentials import Credentials
from google.oauth2 import service_account
import gcsfs
from google.cloud import storage

# Definir la ruta completa al archivo Parquet en GCS
path = 'gs://etl-dask/temporales/test_exportar.parquet'

# Configurar gcsfs con las credenciales de la cuenta de servicio
gcs = gcsfs.GCSFileSystem(token='google-oauth', project='ciencia-de-datos-398421',
                          keyfile_path='Credenciales.json', access="full_control")
gcs

# Leer el archivo Parquet a un Dask DataFrame
df_dask = dd.read_parquet(path=path)

# Visualizar las primeras filas (esto iniciará una computación, ya que Dask es de evaluación perezosa)
print(df_dask.head())

Hi @orlandombaa, welcome to Dask community!

Yes, Dask can read and write to GSC without problem using gcsfs.

What error do you get? You should probably use the storage_options kwarg to give your credentials to the read_parquet call.