How to save the database so that it is readable for the dataframe

PROehidna · March 22, 2022, 1:25pm

The program takes some .csv database, performs computational manipulations with them, and after that it is necessary to save the resulting database so that it is readable using Dask.Dataframe
When reading back the uploaded file in Python, the column types that the dataframe had in the loop should be preserved.
I assume that you need to use csv files + a separate configuration file that specifies the types of columns.

Another question, how can I read a large file in one dataframe?

pavithraes · March 22, 2022, 3:43pm

@PROehidna Hi and welcome to Discourse!

If you want to store the data in csv format, you’re right that you need to store the column types separately. You can use ddf.dtypes to get the types. That said, if each column has consistent types, Dask can usually guess the dtypes correctly (it uses the first few values to guess the column types).

If the file format is not a limitation, you can consider using parquet which will preserve dtypes and is generally more efficient.

Another question, how can I read a large file in one dataframe?

You can use dd.read_csv(), which will load the file in a single Dask DataFrame by default. More details in the documentation here: Create and Store Dask DataFrames — Dask documentation.

Let me know if this helps, and if you’re stuck at something specific, it’ll be super helpful if you can share a minimal, reproducible example!

PROehidna · April 14, 2022, 3:14pm

for i,chunk in enumerate(dfSQL): 

        print("Reading a Block of Data...")

        res = Calculate(chunk,ExRates,log)

        df = dd.from_pandas(res, npartitions=3)

        df.to_csv(filename, index=False, mode='a', compression="gzip")

        pbar.update()                                                  

    pbar.close()

    log.close()

In this cycle, data is unloaded from the database, computational manipulations are performed and stored using the function:
df.to_csv(filename, index=False, mode='a', compression="gzip")

After that, you need to check the readability of the resulting file, I do this using the following functions:

df = dd.read_csv("./genfiles/13_Apr_2022_17_18_04.gz")

    df1 = df.compute()

    df2 = (df['VAL_RUB'].sum()).compute() # Trying to sum a column

    df_len = len(df)

    print(df.head(10))

    print("stop")

But the program gives an error

Topic		Replies	Views
Using dask's read_csv or pandas's read_csv in from_map? Dask DataFrame distributed	3	81	July 31, 2024
How to check that a dataframe is properly built? Dask DataFrame	3	47	November 27, 2024
ValueError: An error occurred while calling the read_csv method registered to the pandas backend Dask DataFrame	6	1031	May 16, 2024
Dask to_sql slow Dask DataFrame sql	3	489	March 28, 2022
Read/Filter CSV taking 7+ Days Dask DataFrame	3	172	February 28, 2024

How to save the database so that it is readable for the dataframe

Related topics