Hello All,
I have a Dask dataframe, which has 33 partitions.
I am trying to drop all the columns, except some from a list and save the result to a Parquet file.
Here is the code I am trying:
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster
TARGETED_COLS = ['col1', 'col2', 'col3']
cluster = LocalCluster(
n_workers=4,
threads_per_worker=1,
timeout="60s"
)
client = Client(cluster)
file_path = "data/my_large_json_lines_file"
ddf = dd.read_json(file_path, lines=True, blocksize="128 MiB")
def process(df):
return df[TARGETED_COLS]
ddf2 = ddf.map_partitions(process).to_parquet("test.parquet")
and I get the following error:
ValueError: Metadata mismatch found in `from_delayed`.
Partition type: `pandas.core.frame.DataFrame`
+--------------------------+---------+----------+
| Column | Found | Expected |
+--------------------------+---------+----------+
| 'demo_col1' | float64 | - |
| 'demo_col2' | object | - |
| 'demo_col3' | object | - |
Can someone please advice what can I do to fix this?
Note: I do not know the initial columns in advance, only the targeted columns are known.
Thank you!