How to save the database so that it is readable for the dataframe

The program takes some .csv database, performs computational manipulations with them, and after that it is necessary to save the resulting database so that it is readable using Dask.Dataframe
When reading back the uploaded file in Python, the column types that the dataframe had in the loop should be preserved.
I assume that you need to use csv files + a separate configuration file that specifies the types of columns.

Another question, how can I read a large file in one dataframe?

@PROehidna Hi and welcome to Discourse!

If you want to store the data in csv format, you’re right that you need to store the column types separately. You can use ddf.dtypes to get the types. That said, if each column has consistent types, Dask can usually guess the dtypes correctly (it uses the first few values to guess the column types).

If the file format is not a limitation, you can consider using parquet which will preserve dtypes and is generally more efficient.

Another question, how can I read a large file in one dataframe?

You can use dd.read_csv(), which will load the file in a single Dask DataFrame by default. More details in the documentation here: Create and Store Dask DataFrames — Dask documentation.

Let me know if this helps, and if you’re stuck at something specific, it’ll be super helpful if you can share a minimal, reproducible example! :smile:

2 Likes
for i,chunk in enumerate(dfSQL): 

        print("Reading a Block of Data...")

        res = Calculate(chunk,ExRates,log)

        df = dd.from_pandas(res, npartitions=3)

        df.to_csv(filename, index=False, mode='a', compression="gzip")

        pbar.update()                                                  

    pbar.close()

    log.close()

In this cycle, data is unloaded from the database, computational manipulations are performed and stored using the function:
df.to_csv(filename, index=False, mode='a', compression="gzip")

After that, you need to check the readability of the resulting file, I do this using the following functions:

df = dd.read_csv("./genfiles/13_Apr_2022_17_18_04.gz")

    df1 = df.compute()

    df2 = (df['VAL_RUB'].sum()).compute() # Trying to sum a column

    df_len = len(df)

    print(df.head(10))

    print("stop")

But the program gives an error