The program takes some .csv database, performs computational manipulations with them, and after that it is necessary to save the resulting database so that it is readable using Dask.Dataframe
When reading back the uploaded file in Python, the column types that the dataframe had in the loop should be preserved.
I assume that you need to use csv files + a separate configuration file that specifies the types of columns.
Another question, how can I read a large file in one dataframe?
@PROehidna Hi and welcome to Discourse!
If you want to store the data in csv
format, you’re right that you need to store the column types separately. You can use ddf.dtypes
to get the types. That said, if each column has consistent types, Dask can usually guess the dtypes
correctly (it uses the first few values to guess the column types).
If the file format is not a limitation, you can consider using parquet
which will preserve dtypes
and is generally more efficient.
Another question, how can I read a large file in one dataframe?
You can use dd.read_csv()
, which will load the file in a single Dask DataFrame by default. More details in the documentation here: Create and Store Dask DataFrames — Dask documentation.
Let me know if this helps, and if you’re stuck at something specific, it’ll be super helpful if you can share a minimal, reproducible example! 
2 Likes
for i,chunk in enumerate(dfSQL):
print("Reading a Block of Data...")
res = Calculate(chunk,ExRates,log)
df = dd.from_pandas(res, npartitions=3)
df.to_csv(filename, index=False, mode='a', compression="gzip")
pbar.update()
pbar.close()
log.close()
In this cycle, data is unloaded from the database, computational manipulations are performed and stored using the function:
df.to_csv(filename, index=False, mode='a', compression="gzip")
After that, you need to check the readability of the resulting file, I do this using the following functions:
df = dd.read_csv("./genfiles/13_Apr_2022_17_18_04.gz")
df1 = df.compute()
df2 = (df['VAL_RUB'].sum()).compute() # Trying to sum a column
df_len = len(df)
print(df.head(10))
print("stop")
But the program gives an error