Creating multiple columns from a rolling window on a single column

carl-krikorian · January 29, 2024, 11:42am

Hello,
I have a very large dataset of DateTimeIndexed signals that cannot be loaded in memory without using dask.

I am trying to do some feature engineering to derive multiple features from a single column using a rolling window of about “100ms” and other time frames (using fast fourier transform and the original data is in 20ms).
I have looked into rolling but my understanding is that the result type should return a single value according to rolling.apply so it is not usable here.

A solution I found in pandas was to use resample and iterate over the bins generated to create a list of my different features (there is an equivalent in dask here) like so:

f1s, f2s = [], []
bins = result.resample("100ms")
for bin in bins:
    # compute new features from column df["X"]
    f1 = ....
    f2 = ....
    f1s.append(f1)
    f2s.append(f2)

However, resample in dask doesn't seem to be iterable so this solution is not applicable either.
I would appreciate any advice on how to best solve my problem.
Thank you for you time.

guillaumeeb · January 31, 2024, 3:18pm

Hi @carl-krikorian, welcome to Dask Discourse forum!

First, could you precise what you mean by rolling window? Is it rolling in the sense you might have a step different than the windows size, or are each original value in only one window?

You talk about using resample: do you need the original values or the resampled ones?

Could you provide some reproducible example?

One easy solution if your dataset is already sorted by time could be to use map_partitions, and code using Pandas.

Topic		Replies	Views
Turn an array column in a dask dataframe into multiple columns Dask DataFrame	0	273	August 31, 2022
Use row indexing for rolling lags Dask DataFrame	1	92	January 10, 2024
Performance of Dask DataFrames for Feature Engineering Dask DataFrame	9	1169	March 2, 2023
How can I reindex in Dask dataframe with timeseries index Dask DataFrame	5	774	February 9, 2023
Perform the same operation on all columns of a dask dataframe in parallel Dask DataFrame delayed , distributed , dask-ml	5	216	November 10, 2022

Creating multiple columns from a rolling window on a single column

Related topics