Hello
I would like to migrate some local ETL process from Apache Spark to Dask. In order to do this I have to be able to read and write data from/to Cloud Storage (in form of parquet).
Can Dask do this?, I have tried to do this without success. I share one of the code that I have been triying:
from google.oauth2.credentials import Credentials
from google.oauth2 import service_account
import gcsfs
from google.cloud import storage
# Definir la ruta completa al archivo Parquet en GCS
path = 'gs://etl-dask/temporales/test_exportar.parquet'
# Configurar gcsfs con las credenciales de la cuenta de servicio
gcs = gcsfs.GCSFileSystem(token='google-oauth', project='ciencia-de-datos-398421',
keyfile_path='Credenciales.json', access="full_control")
gcs
# Leer el archivo Parquet a un Dask DataFrame
df_dask = dd.read_parquet(path=path)
# Visualizar las primeras filas (esto iniciará una computación, ya que Dask es de evaluación perezosa)
print(df_dask.head())