Hi. In my dask dataframe transformation pipeline, i’m trying to verify that my data in a dask.dataframe satisfy some assert.
What is the best practice verify/assert ones dataframe, such that both the data validation and dataframe transformation is performed in the same .compute() (i.e. parallel job).
My example is:
import dask.dataframe as ddf
ddf_data = ddf.read_csv(...)
list_of_expected_str = ['a', 'b']
unit_map = {'a': 1, 'b': 2}
assert (ddf_data.col_units.isin(list(unit_map.keys()))).all())
ddf_data.[col_values].mul(ddf_data[col_units].map(unit_map))