Cloud Storage and Dask

orlandombaa · October 20, 2023, 11:26pm

Hello

I would like to migrate some local ETL process from Apache Spark to Dask. In order to do this I have to be able to read and write data from/to Cloud Storage (in form of parquet).

Can Dask do this?, I have tried to do this without success. I share one of the code that I have been triying:

from google.oauth2.credentials import Credentials
from google.oauth2 import service_account
import gcsfs
from google.cloud import storage

# Definir la ruta completa al archivo Parquet en GCS
path = 'gs://etl-dask/temporales/test_exportar.parquet'

# Configurar gcsfs con las credenciales de la cuenta de servicio
gcs = gcsfs.GCSFileSystem(token='google-oauth', project='ciencia-de-datos-398421',
                          keyfile_path='Credenciales.json', access="full_control")
gcs

# Leer el archivo Parquet a un Dask DataFrame
df_dask = dd.read_parquet(path=path)

# Visualizar las primeras filas (esto iniciará una computación, ya que Dask es de evaluación perezosa)
print(df_dask.head())

guillaumeeb · October 22, 2023, 4:42pm

Hi @orlandombaa, welcome to Dask community!

Yes, Dask can read and write to GSC without problem using gcsfs.

What error do you get? You should probably use the storage_options kwarg to give your credentials to the read_parquet call.

Topic		Replies	Views
Loading Parquet file from S3 using HDFS file system Dask DataFrame	4	241	March 8, 2024
How does read_csv or read_parquet distribute read operations? Dask DataFrame	3	318	June 14, 2022
How are calls to external storage providers made?	2	149	August 5, 2022
Compression Levels while storing Dask DataFrame & Dask Array Dask DataFrame dask-array , distributed	2	661	July 22, 2022
Dask team releases version 2022.04.2 Announcements	0	441	May 2, 2022

Cloud Storage and Dask

Related topics